scispace - formally typeset
Search or ask a question

Showing papers in "Genome Biology in 2002"


Journal ArticleDOI
TL;DR: The normalization strategy presented here is a prerequisite for accurate RT-PCR expression profiling, which opens up the possibility of studying the biological relevance of small expression differences.
Abstract: Gene-expression analysis is increasingly important in biological research, with real-time reverse transcription PCR (RT-PCR) becoming the method of choice for high-throughput and accurate expression profiling of selected genes. Given the increased sensitivity, reproducibility and large dynamic range of this methodology, the requirements for a proper internal control gene for normalization have become increasingly stringent. Although housekeeping gene expression has been reported to vary considerably, no systematic survey has properly determined the errors related to the common practice of using only one control gene, nor presented an adequate way of working around this problem. We outline a robust and innovative strategy to identify the most stably expressed control genes in a given set of tissues, and to determine the minimum number of genes required to calculate a reliable normalization factor. We have evaluated ten housekeeping genes from different abundance and functional classes in various human tissues, and demonstrated that the conventional use of a single gene for normalization leads to relatively large errors in a significant proportion of samples tested. The geometric mean of multiple carefully selected housekeeping genes was validated as an accurate normalization factor by analyzing publicly available microarray data. The normalization strategy presented here is a prerequisite for accurate RT-PCR expression profiling, which, among other things, opens up the possibility of studying the biological relevance of small expression differences.

18,261 citations


Journal ArticleDOI
TL;DR: Gene expression is finely regulated at the post-transcriptional level through the regulation of stem-loop structures, upstream initiation codons and open reading frames, internal ribosome entry sites and various cis-acting elements that are bound by RNA-binding proteins.
Abstract: Gene expression is finely regulated at the post-transcriptional level. Features of the untranslated regions of mRNAs that control their translation, degradation and localization include stem-loop structures, upstream initiation codons and open reading frames, internal ribosome entry sites and various cis-acting elements that are bound by RNA-binding proteins.

966 citations


Journal ArticleDOI
TL;DR: Understanding of prokaryote biology from study of pure cultures and genome sequencing has been limited by a pronounced sampling bias towards four bacterial phyla - Proteobacteria, Firmicutes, Actinobacteria and Bacteroidetes.
Abstract: Our understanding of prokaryote biology from study of pure cultures and genome sequencing has been limited by a pronounced sampling bias towards four bacterial phyla - Proteobacteria, Firmicutes, Actinobacteria and Bacteroidetes - out of 35 bacterial and 18 archaeal phylum-level lineages. This bias is beginning to be rectified by the use of phylogenetically directed isolation strategies and by directly accessing microbial genomes from environmental samples.

878 citations


Journal ArticleDOI
TL;DR: The soluble glutathione transferases (GSTs) represent an excellent example of how protein families can diversify to fulfill multiple functions while conserving form and structure.
Abstract: The soluble glutathione transferases (GSTs, EC 2.5.1.18) are encoded by a large and diverse gene family in plants, which can be divided on the basis of sequence identity into the phi, tau, theta, zeta and lambda classes. The theta and zeta GSTs have counterparts in animals but the other classes are plant-specific and form the focus of this article. The genome of Arabidopsis thaliana contains 48 GST genes, with the tau and phi classes being the most numerous. The GST proteins have evolved by gene duplication to perform a range of functional roles using the tripeptide glutathione (GSH) as a cosubstrate or coenzyme. GSTs are predominantly expressed in the cytosol, where their GSH-dependent catalytic functions include the conjugation and resulting detoxification of herbicides, the reduction of organic hydroperoxides formed during oxidative stress and the isomerization of maleylacetoacetate to fumarylacetoacetate, a key step in the catabolism of tyrosine. GSTs also have non-catalytic roles, binding flavonoid natural products in the cytosol prior to their deposition in the vacuole. Recent studies have also implicated GSTs as components of ultraviolet-inducible cell signaling pathways and as potential regulators of apoptosis. Although sequence diversification has produced GSTs with multiple functions, the structure of these proteins has been highly conserved. The GSTs thus represent an excellent example of how protein families can diversify to fulfill multiple functions while conserving form and structure.

820 citations


Journal ArticleDOI
TL;DR: An epidemiologic perspective on the issue of human categorization in biomedical and genetic research that strongly supports the continued use of self-identified race and ethnicity is provided.
Abstract: A debate has arisen regarding the validity of racial/ethnic categories for biomedical and genetic research. Some claim 'no biological basis for race' while others advocate a 'race-neutral' approach, using genetic clustering rather than self-identified ethnicity for human genetic categorization. We provide an epidemiologic perspective on the issue of human categorization in biomedical and genetic research that strongly supports the continued use of self-identified race and ethnicity.

770 citations


Journal ArticleDOI
TL;DR: Analyzing gene-expression patterns by in situ hybridization to whole-mount embryos provides an extremely rich dataset that can be used to identify genes involved in developmental processes that have been missed by traditional genetic analysis.
Abstract: Background: Cell-fate specification and tissue differentiation during development are largely achieved by the regulation of gene transcription. Results: As a first step to creating a comprehensive atlas of gene-expression patterns during Drosophila embryogenesis, we examined 2,179 genes by in situ hybridization to fixed Drosophila embryos. Of the genes assayed, 63.7% displayed dynamic expression patterns that were documented with 25,690 digital photomicrographs of individual embryos. The photomicrographs were annotated using controlled vocabularies for anatomical structures that are organized into a developmental hierarchy. We also generated a detailed time course of gene expression during embryogenesis using microarrays to provide an independent corroboration of the in situ hybridization results. All image, annotation and microarray data are stored in publicly available database. We found that the RNA transcripts of about 1% of genes show clear subcellular localization. Nearly all the annotated expression patterns are distinct. We present an approach for organizing the data by hierarchical clustering of annotation terms that allows us to group tissues that express similar sets of genes as well as genes displaying similar expression patterns. Conclusions: Analyzing gene-expression patterns by in situ hybridization to whole-mount embryos provides an extremely rich dataset that can be used to identify genes involved in developmental processes that have been missed by traditional genetic analysis. Systematic analysis of rigorously annotated patterns of gene expression will complement and extend the types of analyses carried out using expression microarrays.

740 citations


Journal ArticleDOI
TL;DR: A new prediction-based resampling method, Clest, is developed, to estimate the number of clusters in a dataset, and was generally found to be more accurate and robust than the six existing methods considered in the study.
Abstract: Microarray technology is increasingly being applied in biological and medical research to address a wide range of problems, such as the classification of tumors. An important statistical problem associated with tumor classification is the identification of new tumor classes using gene-expression profiles. Two essential aspects of this clustering problem are: to estimate the number of clusters, if any, in a dataset; and to allocate tumor samples to these clusters, and assess the confidence of cluster assignments for individual samples. Here we address the first of these problems. We have developed a new prediction-based resampling method, Clest, to estimate the number of clusters in a dataset. The performance of the new and existing methods were compared using simulated data and gene-expression data from four recently published cancer microarray studies. Clest was generally found to be more accurate and robust than the six existing methods considered in the study. Focusing on prediction accuracy in conjunction with resampling produces accurate and robust estimates of the number of clusters.

728 citations


Journal ArticleDOI
TL;DR: It is hypothesized that gene duplications that persist in an evolving lineage are beneficial from the time of their origin, due primarily to a protein dosage effect in response to variable environmental conditions; duplications are likely to give rise to new functions at a later phase of their evolution once a higher level of divergence is reached.
Abstract: Background Gene duplications have a major role in the evolution of new biological functions. Theoretical studies often assume that a duplication per se is selectively neutral and that, following a duplication, one of the gene copies is freed from purifying (stabilizing) selection, which creates the potential for evolution of a new function.

725 citations


Journal ArticleDOI
TL;DR: This analysis represents an initial characterization of the transposable elements in the Release 3 euchromatic genomic sequence of D. melanogaster for which comparison to the transPOSable elements of other organisms can begin to be made.
Abstract: Transposable elements are found in the genomes of nearly all eukaryotes. The recent completion of the Release 3 euchromatic genomic sequence of Drosophila melanogaster by the Berkeley Drosophila Genome Project has provided precise sequence for the repetitive elements in the Drosophila euchromatin. We have used this genomic sequence to describe the euchromatic transposable elements in the sequenced strain of this species. We identified 85 known and eight novel families of transposable element varying in copy number from one to 146. A total of 1,572 full and partial transposable elements were identified, comprising 3.86% of the sequence. More than two-thirds of the transposable elements are partial. The density of transposable elements increases an average of 4.7 times in the centromere-proximal regions of each of the major chromosome arms. We found that transposable elements are preferentially found outside genes; only 436 of 1,572 transposable elements are contained within the 61.4 Mb of sequence that is annotated as being transcribed. A large proportion of transposable elements is found nested within other elements of the same or different classes. Lastly, an analysis of structural variation from different families reveals distinct patterns of deletion for elements belonging to different classes. This analysis represents an initial characterization of the transposable elements in the Release 3 euchromatic genomic sequence of D. melanogaster for which comparison to the transposable elements of other organisms can begin to be made. These data have been made available on the Berkeley Drosophila Genome Project website for future analyses.

593 citations


Journal ArticleDOI
TL;DR: A simple and robust non-linear method for normalization using array signal distribution analysis and cubic splines is presented and it is shown that intensity-dependent normalization is important for both high-density oligonucleotide array and cDNA array data.
Abstract: Microarray data are subject to multiple sources of variation, of which biological sources are of interest whereas most others are only confounding. Recent work has identified systematic sources of variation that are intensity-dependent and non-linear in nature. Systematic sources of variation are not limited to the differing properties of the cyanine dyes Cy5 and Cy3 as observed in cDNA arrays, but are the general case for both oligonucleotide microarray (Affymetrix GeneChips) and cDNA microarray data. Current normalization techniques are most often linear and therefore not capable of fully correcting for these effects. We present here a simple and robust non-linear method for normalization using array signal distribution analysis and cubic splines. These methods compared favorably to normalization using robust local-linear regression (lowess). The application of these methods to oligonucleotide arrays reduced the relative error between replicates by 5-10% compared with a standard global normalization method. Application to cDNA arrays showed improvements over the standard method and over Cy3-Cy5 normalization based on dye-swap replication. In addition, a set of known differentially regulated genes was ranked higher by the t-test. In either cDNA or Affymetrix technology, signal-dependent bias was more than ten times greater than the observed print-tip or spatial effects. Intensity-dependent normalization is important for both high-density oligonucleotide array and cDNA array data. Both the regression and spline-based methods described here performed better than existing linear methods when assessed on the variability of replicate arrays. Dye-swap normalization was less effective at Cy3-Cy5 normalization than either regression or spline-based methods alone.

553 citations


Journal ArticleDOI
TL;DR: Fuzzy k-means clustering is a useful analytical tool for extracting biological insights from gene-expression data and suggests that a prevalent theme in the regulation of yeast gene expression is the condition-specific coregulation of overlapping sets of genes.
Abstract: Organisms simplify the orchestration of gene expression by coregulating genes whose products function together in the cell. Many proteins serve different roles depending on the demands of the organism, and therefore the corresponding genes are often coexpressed with different groups of genes under different situations. This poses a challenge in analyzing whole-genome expression data, because many genes will be similarly expressed to multiple, distinct groups of genes. Because most commonly used analytical methods cannot appropriately represent these relationships, the connections between conditionally coregulated genes are often missed. We used a heuristically modified version of fuzzy k-means clustering to identify overlapping clusters of yeast genes based on published gene-expression data following the response of yeast cells to environmental changes. We have validated the method by identifying groups of functionally related and coregulated genes, and in the process we have uncovered new correlations between yeast genes and between the experimental conditions based on similarities in gene-expression patterns. To investigate the regulation of gene expression, we correlated the clusters with known transcription factor binding sites present in the genes' promoters. These results give insights into the mechanism of the regulation of gene expression in yeast cells responding to environmental changes. Fuzzy k-means clustering is a useful analytical tool for extracting biological insights from gene-expression data. Our analysis presented here suggests that a prevalent theme in the regulation of yeast gene expression is the condition-specific coregulation of overlapping sets of genes.

Journal ArticleDOI
TL;DR: This work presents a web-based customizable bioinformatics solution called BioArray Software Environment (BASE) for the management and analysis of all areas of microarray experimentation.
Abstract: The microarray technique requires the organization and analysis of vast amounts of data These data include information about the samples hybridized, the hybridization images and their extracted data matrices, and information about the physical array, the features and reporter molecules We present a web-based customizable bioinformatics solution called BioArray Software Environment (BASE) for the management and analysis of all areas of microarray experimentation All software necessary to run a local server is freely available

Journal ArticleDOI
TL;DR: MAGE will help microarray data producers and users to exchange information by providing a common platform for data exchange, and MAGE-STK will make the adoption of MAGE easier.
Abstract: Background Meaningful exchange of microarray data is currently difficult because it is rare that published data provide sufficient information depth or are even in the same format from one publication to another. Only when data can be easily exchanged will the entire biological community be able to derive the full benefit from such microarray studies.

Journal ArticleDOI
TL;DR: This work identified TSS candidates for about 2,000 Drosophila genes by aligning 5' expressed sequence tags from cap-trapped cDNA libraries to the genome, while applying stringent criteria concerning coverage and 5'-end distribution and identified several new motifs enriched in promoter regions.
Abstract: Background: The core promoter, a region of about 100 base-pairs flanking the transcription start site (TSS), serves as the recognition site for the basal transcription apparatus. Drosophila TSSs have generally been mapped by individual experiments; the low number of accurately mapped TSSs has limited analysis of promoter sequence motifs and the training of computational prediction tools. Results: We identified TSS candidates for about 2,000 Drosophila genes by aligning 5 expressed sequence tags (ESTs) from cap-trapped cDNA libraries to the genome, while applying stringent criteria concerning coverage and 5-end distribution. Examination of the sequences flanking these TSSs revealed the presence of well-known core promoter motifs such as the TATA box, the initiator and the downstream promoter element (DPE). We also define, and assess the distribution of, several new motifs prevalent in core promoters, including what appears to be a variant DPE motif. Among the prevalent motifs is the DNA-replication-related element DRE, recently shown to be part of the recognition site for the TBP-related factor TRF2. Our TSS set was then used to retrain the computational promoter predictor McPromoter, allowing us to improve the recognition performance to over 50% sensitivity and 40% specificity. We compare these computational results to promoter prediction in vertebrates. Conclusions: There are relatively few recognizable binding sites for previously known general transcription factors in Drosophila core promoters. However, we identified several new motifs enriched in promoter regions. We were also able to significantly improve the performance of computational TSS prediction in Drosophila.

Journal ArticleDOI
TL;DR: FlyBase biologists successfully used Apollo to annotate the Drosophila melanogaster genome and it is increasingly being used as a starting point for the development of customized annotation editing tools for other genome projects.
Abstract: The well-established inaccuracy of purely computational methods for annotating genome sequences necessitates an interactive tool to allow biological experts to refine these approximations by viewing and independently evaluating the data supporting each annotation. Apollo was developed to meet this need, enabling curators to inspect genome annotations closely and edit them. FlyBase biologists successfully used Apollo to annotate the Drosophila melanogaster genome and it is increasingly being used as a starting point for the development of customized annotation editing tools for other genome projects.

Journal ArticleDOI
TL;DR: The WGS strategy can efficiently produce a high-quality sequence of a metazoan genome while generating the reagents required for sequence finishing, however, the initial method of repeat assembly was flawed.
Abstract: Background The Drosophila melanogaster genome was the first metazoan genome to have been sequenced by the whole-genome shotgun (WGS) method. Two issues relating to this achievement were widely debated in the genomics community: how correct is the sequence with respect to base-pair (bp) accuracy and frequency of assembly errors? And, how difficult is it to bring a WGS sequence to the accepted standard for finished sequence? We are now in a position to answer these questions.

Journal ArticleDOI
TL;DR: A large family of signal-transduction enzymes that autophosphorylate on a conserved histidine residue are found, which are important for multiple functions in bacteria, including chemotaxis and quorum sensing, and in eukaryotes, including hormone-dependent developmental processes.
Abstract: Histidine protein kinases (HPKs) are a large family of signal-transduction enzymes that autophosphorylate on a conserved histidine residue. HPKs form two-component signaling systems together with their downstream target proteins, the response regulators, which have a conserved aspartate in a so-called 'receiver domain' that is phosphorylated by the HPK. Two-component signal transduction is prevalent in bacteria and is also widely used by eukaryotes outside the animal kingdom. The typical HPK is a transmembrane receptor with an amino-terminal extracellular sensing domain and a carboxy-terminal cytosolic signaling domain; most, if not all, HPKs function as dimers. They show little similarity to protein kinases that phosphorylate serine, threonine or tyrosine residues, but may share a distant evolutionary relationship with these enzymes. In excess of a thousand known genes encode HPKs, which are important for multiple functions in bacteria, including chemotaxis and quorum sensing, and in eukaryotes, including hormone-dependent developmental processes. The proteins divide into at least 11 subfamilies, only one of which is present in eukaryotes, suggesting that lateral gene transfer gave rise to two-component signaling in these organisms.

Journal ArticleDOI
TL;DR: Identification of so many unusual gene models in Drosophila suggests that some mechanisms for gene regulation are more prevalent than previously believed, and underscores the complex challenges of eukaryotic gene prediction.
Abstract: Background: The recent completion of the Drosophila melanogaster genomic sequence to high quality and the availability of a greatly expanded set of Drosophila cDNA sequences, aligning to 78% of the predicted euchromatic genes, afforded FlyBase the opportunity to significantly improve genomic annotations. We made the annotation process more rigorous by inspecting each gene visually, utilizing a comprehensive set of curation rules, requiring traceable evidence for each gene model, and comparing each predicted peptide to SWISS-PROT and TrEMBL sequences. Results: Although the number of predicted protein-coding genes in Drosophila remains essentially unchanged, the revised annotation significantly improves gene models, resulting in structural changes to 85% of the transcripts and 45% of the predicted proteins. We annotated transposable elements and non-protein-coding RNAs as new features, and extended the annotation of untranslated (UTR) sequences and alternative transcripts to include more than 70% and 20% of genes, respectively. Finally, cDNA sequence provided evidence for dicistronic transcripts, neighboring genes with overlapping UTRs on the same DNA sequence strand, alternatively spliced genes that encode distinct, non-overlapping peptides, and numerous nested genes. Conclusions: Identification of so many unusual gene models not only suggests that some mechanisms for gene regulation are more prevalent than previously believed, but also underscores the complex challenges of eukaryotic gene prediction. At present, experimental data and human curation remain essential to generate high-quality genome annotations.

Journal ArticleDOI
TL;DR: Analysis of cell-line samples can identify systematic structure in measured gene-expression levels and shows how pooling a small number of samples with a diverse representation of expressed genes can outperform more complex mixtures as a reference sample.
Abstract: Background 'Fold-change' cutoffs have been widely used in microarray assays to identify genes that are differentially expressed between query and reference samples. More accurate measures of differential expression and effective data-normalization strategies are required to identify high-confidence sets of genes with biologically meaningful changes in transcription. Further, the analysis of a large number of expression profiles is facilitated by a common reference sample, the construction of which must be carefully addressed.

Journal ArticleDOI
TL;DR: An emerging family of structurally distinct dual-specificity serine, threonine and tyrosine phosphatases that act on MAP kinases consists of ten members in mammals, and members have been found in animals, plants and yeast.
Abstract: Mitogen-activated protein MAP kinases are key signal-transducing enzymes that are activated by a wide range of extracellular stimuli. They are responsible for the induction of a number of cellular responses, such as changes in gene expression, proliferation, differentiation, cell cycle arrest and apoptosis. Although regulation of MAP kinases by a phosphorylation cascade has long been recognized as significant, their inactivation through the action of specific phosphatases has been less studied. An emerging family of structurally distinct dual-specificity serine, threonine and tyrosine phosphatases that act on MAP kinases consists of ten members in mammals, and members have been found in animals, plants and yeast. Three subgroups have been identified that differ in exon structure, sequence and substrate specificity.

Journal ArticleDOI
TL;DR: Phylogenetic analysis of the ADF/cofilins reveals that, with few exceptions, their relationships reflect conventional views of the relationships between the major groups of organisms.
Abstract: The ADF/cofilins are a family of actin-binding proteins expressed in all eukaryotic cells so far examined. Members of this family remodel the actin cytoskeleton, for example during cytokinesis, when the actin-rich contractile ring shrinks as it contracts through the interaction of ADF/cofilins with both monomeric and filamentous actin. The depolymerizing activity is twofold: ADF/cofilins sever actin filaments and also increase the rate at which monomers leave the filament's pointed end. The three-dimensional structure of ADF/cofilins is similar to a fold in members of the gelsolin family of actin-binding proteins in which this fold is typically repeated three or six times; although both families bind polyphosphoinositide lipids and actin in a pH-dependent manner, they share no obvious sequence similarity. Plants and animals have multiple ADF/cofilin genes, belonging in vertebrates to two types, ADF and cofilins. Other eukaryotes (such as yeast, Acanthamoeba and slime moulds) have a single ADF/cofilin gene. Phylogenetic analysis of the ADF/cofilins reveals that, with few exceptions, their relationships reflect conventional views of the relationships between the major groups of organisms.

Journal ArticleDOI
TL;DR: New methods for finding gene sets that are well suited for distinguishing experiment classes, such as healthy versus diseased tissues, are presented, based on evaluating genes in pairs and evaluating how well a pair in combination distinguishes two experiment classes.
Abstract: Background Methods for extracting useful information from the datasets produced by microarray experiments are at present of much interest. Here we present new methods for finding gene sets that are well suited for distinguishing experiment classes, such as healthy versus diseased tissues. Our methods are based on evaluating genes in pairs and evaluating how well a pair in combination distinguishes two experiment classes. We tested the ability of our pair-based methods to select gene sets that generalize the differences between experiment classes and compared the performance relative to two standard methods. To assess the ability to generalize class differences, we studied how well the gene sets we select are suited for learning a classifier.

Journal ArticleDOI
TL;DR: Bacterial artificial chromosome (BAC)-based fluorescence in situ hybridization analysis was used to correlate the genomic sequence with the cytogenetic map in order to refine the genomic definition of the centric heterochromatin; on the basis of the cytological definition, the annotated Release 3 euchromatic sequence extends into the centrics of the Drosophila genome on each chromosome arm.
Abstract: Background: Most eukaryotic genomes include a substantial repeat-rich fraction termed heterochromatin, which is concentrated in centric and telomeric regions. The repetitive nature of heterochromatic sequence makes it difficult to assemble and analyze. To better understand the heterochromatic component of the Drosophila melanogaster genome, we characterized and annotated portions of a whole-genome shotgun sequence assembly. Results: WGS3, an improved whole-genome shotgun assembly, includes 20.7 Mb of draft-quality sequence not represented in the Release 3 sequence spanning the euchromatin. We annotated this sequence using the methods employed in the re-annotation of the Release 3 euchromatic sequence. This analysis predicted 297 protein-coding genes and six non-protein-coding genes, including known heterochromatic genes, and regions of similarity to known transposable elements. Bacterial artificial chromosome (BAC)-based fluorescence in situ hybridization analysis was used to correlate the genomic sequence with the cytogenetic map in order to refine the genomic definition of the centric heterochromatin; on the basis of our cytological definition, the annotated Release 3 euchromatic sequence extends into the centric heterochromatin on each chromosome arm. Conclusions: Whole-genome shotgun assembly produced a reliable draft-quality sequence of a significant part of the Drosophila heterochromatin. Annotation of this sequence defined the intron-exon structures of 30 known protein-coding genes and 267 protein-coding gene models. The cytogenetic mapping suggests that an additional 150 predicted genes are located in heterochromatin at the base of the Release 3 euchromatic sequence. Our analysis suggests strategies for improving the sequence and annotation of the heterochromatic portions of the Drosophila and other complex genomes.

Journal ArticleDOI
TL;DR: The analyses suggest that the Ca2+ messenger system is widely used in plants and that EF-hand-containing proteins are likely to be the key transducers mediating Ca2+, in regulating many cellular and developmental processes.
Abstract: In plants, calcium (Ca2+) has emerged as an important messenger mediating the action of many hormonal and environmental signals, including biotic and abiotic stresses. Many different signals raise cytosolic calcium concentration ([Ca2+]cyt), which in turn is thought to regulate cellular and developmental processes via Ca2+-binding proteins. Three out of the four classes of Ca2+-binding proteins in plants contain Ca2+-binding EF-hand motif(s). This motif is a conserved helix-loop-helix structure that can bind a single Ca2+ ion. To identify all EF-hand-containing proteins in Arabidopsis, we analyzed its completed genome sequence for genes encoding EF-hand-containing proteins. A maximum of 250 proteins possibly having EF-hands were identified. Diverse proteins, including enzymes, proteins involved in transcription and translation, protein- and nucleic-acid-binding proteins and a large number of unknown proteins, have one or more putative EF-hands. Phylogenetic analysis identified six major groups that contain some families of proteins. The presence of EF-hand motif(s) in a diversity of proteins is consistent with the involvement of Ca2+ in regulating many cellular and developmental processes. Thus far, only 47 of the possible 250 EF-hand proteins have been reported in the literature. Various domains that we identified in many of the uncharacterized EF-hand-containing proteins should help in elucidating their cellular role(s). Our analyses suggest that the Ca2+ messenger system is widely used in plants and that EF-hand-containing proteins are likely to be the key transducers mediating Ca2+ action.

Journal ArticleDOI
TL;DR: Almo-terminal acetylation occurs on the bulk of eukaryotic proteins and on regulatory peptides, whereas lysine acetylations occurs at different positions on a variety of proteins, including histones, transcription factors, nuclear import factors, and α-tubulin.
Abstract: Acetylation of proteins, either on various amino-terminal residues or on the e-amino group of lysine residues, is catalyzed by a wide range of acetyltransferases. Amino-terminal acetylation occurs on the bulk of eukaryotic proteins and on regulatory peptides, whereas lysine acetylation occurs at different positions on a variety of proteins, including histones, transcription factors, nuclear import factors, and α-tubulin.

Journal ArticleDOI
TL;DR: The regulatory elements that direct alternative splicing and how genome-wide analyses can aid in their identification are discussed.
Abstract: Alternative splicing of pre-mRNAs is central to the generation of diversity from the relatively small number of genes in metazoan genomes. Auxiliary cis elements and trans-acting factors are required for the recognition of constitutive and alternatively spliced exons and their inclusion in pre-mRNA. Here, we discuss the regulatory elements that direct alternative splicing and how genome-wide analyses can aid in their identification.

Journal ArticleDOI
TL;DR: The authors' trees show the overall relationship of 277 GPCRs with emphasis on orphan receptors, and may prove valuable for identification of the natural ligands of orphan receptors as their relation to receptors with known ligands becomes more evident.
Abstract: G-protein-coupled receptors (GPCRs) are the largest and most diverse family of transmembrane receptors. They respond to a wide range of stimuli, including small peptides, lipid analogs, amino-acid derivatives, and sensory stimuli such as light, taste and odor, and transmit signals to the interior of the cell through interaction with heterotrimeric G proteins. A large number of putative GPCRs have no identified natural ligand. We hypothesized that a more complete knowledge of the phylogenetic relationship of these orphan receptors to receptors with known ligands could facilitate ligand identification, as related receptors often have ligands with similar structural features. A database search excluding olfactory and gustatory receptors was used to compile a list of accession numbers and synonyms of 81 orphan and 196 human GPCRs with known ligands. Of these, 241 sequences belonging to the rhodopsin receptor-like family A were aligned and a tentative phylogenetic tree constructed by neighbor joining. This tree and local alignment tools were used to define 19 subgroups of family A small enough for more accurate maximum-likelihood analyses. The secretin receptor-like family B and metabotropic glutamate receptor-like family C were directly subjected to these methods. Our trees show the overall relationship of 277 GPCRs with emphasis on orphan receptors. Support values are given for each branch. This approach may prove valuable for identification of the natural ligands of orphan receptors as their relation to receptors with known ligands becomes more evident.

Journal ArticleDOI
TL;DR: A detailed analysis of the anatomy and distribution of L1 elements in the human genome using a new computer program, TSDfinder, designed to identify transposon boundaries precisely found no correlation between the composition and genomic location of the pre-insertion locus and the resulting anatomy of the L1 insertion.
Abstract: As the rough draft of the human genome sequence nears a finished product and other genome-sequencing projects accumulate sequence data exponentially, bioinformatics is emerging as an important tool for studies of transposon biology. In particular, L1 elements exhibit a variety of sequence structures after insertion into the human genome that are amenable to computational analysis. We carried out a detailed analysis of the anatomy and distribution of L1 elements in the human genome using a new computer program, TSDfinder, designed to identify transposon boundaries precisely. Structural variants of L1 elements shared similar trends in the length and quality of their target site duplications (TSDs) and poly(A) tails. Furthermore, we found no correlation between the composition and genomic location of the pre-insertion locus and the resulting anatomy of the L1 insertion. We verified that L1 insertions with TSDs have the 5'-TTAAAA-3' cleavage site associated with L1 endonuclease activity. In addition, the second target DNA cut required for L1 insertion weakly matches the consensus pattern TTAAAA. On the other hand, the L1-internal breakpoints of deleted and inverted L1 elements do not resemble L1 endonuclease cleavage sites. Finally, the genome sequence data indicate that whereas singly inverted elements are common, doubly inverted elements are almost never found. The sequence data give no indication that the creation of L1 structural variants depends on characteristics of the insertion locus. In addition, the formation of 5' truncated and 5' inverted L1s are probably not due to the action of the L1 endonuclease.

Journal ArticleDOI
TL;DR: It is suggested that the diversification of bHLH genes is directly linked to the acquisition of multicellularity, and that important diversified of the b HLH repertoire occurred independently in animals and plants.
Abstract: The basic helix-loop-helix (bHLH) proteins are a large and complex multigene family of transcription factors with important roles in animal development, including that of fruitflies, nematodes and vertebrates. The identification of orthologous relationships among the bHLH genes from these widely divergent taxa allows reconstruction of the putative complement of bHLH genes present in the genome of their last common ancestor. We identified 39 different bHLH genes in the worm Caenorhabditis elegans, 58 in the fly Drosophila melanogaster and 125 in human (Homo sapiens). We defined 44 orthologous families that include most of these bHLH genes. Of these, 43 include both human and fly and/or worm genes, indicating that genes from these families were already present in the last common ancestor of worm, fly and human. Only two families contain both yeast and animal genes, and no family contains both plant and animal bHLH genes. We suggest that the diversification of bHLH genes is directly linked to the acquisition of multicellularity, and that important diversification of the bHLH repertoire occurred independently in animals and plants. As the last common ancestor of worm, fly and human is also that of all bilaterian animals, our analysis indicates that this ancient ancestor must have possessed at least 43 different types of bHLH, highlighting its genomic complexity.

Journal ArticleDOI
TL;DR: It is shown that ARD is a promising technique that turns out to be complementary to existing gene-expression clustering techniques and is validated to freely available human serial analysis of gene expression data.
Abstract: BACKGROUND: The association-rules discovery (ARD) technique has yet to be applied to gene-expression data analysis. Even in the absence of previous biological knowledge, it should identify sets of genes whose expression is correlated. The first association-rule miners appeared six years ago and proved efficient at dealing with sparse and weakly correlated data. A huge international research effort has led to new algorithms for tackling difficult contexts and these are particularly suited to analysis of large gene-expression matrices. To validate the ARD technique we have applied it to freely available human serial analysis of gene expression (SAGE) data. RESULTS: The approach described here enables us to designate sets of strong association rules. We normalized the SAGE data before applying our association rule miner. Depending on the discretization algorithm used, different properties of the data were highlighted. Both common and specific interpretations could be made from the extracted rules. In each and every case the extracted collections of rules indicated that a very strong co-regulation of mRNA encoding ribosomal proteins occurs in the dataset. Several rules associating proteins involved in signal transduction were obtained and analyzed, some pointing to yet-unexplored directions. Furthermore, by examining a subset of these rules, we were able both to reassign a wrongly labeled tag, and to propose a function for an expressed sequence tag encoding a protein of unknown function. CONCLUSIONS: We show that ARD is a promising technique that turns out to be complementary to existing gene-expression clustering techniques.