scispace - formally typeset
Search or ask a question

Showing papers in "Nucleic Acids Research in 2009"


Journal ArticleDOI
TL;DR: The survey will help tool designers/developers and experienced end users understand the underlying algorithms and pertinent details of particular tool categories/tools, enabling them to make the best choices for their particular research interests.
Abstract: Functional analysis of large gene lists, derived in most cases from emerging high-throughput genomic, proteomic and bioinformatics scanning approaches, is still a challenging and daunting task. The gene-annotation enrichment analysis is a promising high-throughput strategy that increases the likelihood for investigators to identify biological processes most pertinent to their study. Approximately 68 bioinformatics enrichment tools that are currently available in the community are collected in this survey. Tools are uniquely categorized into three major classes, according to their underlying enrichment algorithms. The comprehensive collections, unique tool classifications and associated questions/issues will provide a more comprehensive and up-to-date view regarding the advantages, pitfalls and recent trends in a simpler tool-class level rather than by a tool-by-tool approach. Thus, the survey will help tool designers/developers and experienced end users understand the underlying algorithms and pertinent details of particular tool categories/tools, enabling them to make the best choices for their particular research interests.

13,102 citations


Journal ArticleDOI
TL;DR: The popular MEME motif discovery algorithm is now complemented by the GLAM2 algorithm which allows discovery of motifs containing gaps, and all of the motif-based tools are now implemented as web services via Opal.
Abstract: The MEME Suite web server provides a unified portal for online discovery and analysis of sequence motifs representing features such as DNA binding sites and protein interaction domains. The popular MEME motif discovery algorithm is now complemented by the GLAM2 algorithm which allows discovery of motifs containing gaps. Three sequence scanning algorithms—MAST, FIMO and GLAM2SCAN—allow scanning numerous DNA and protein sequence databases for motifs discovered by MEME and GLAM2. Transcription factor motifs (including those discovered using MEME) can be compared with motifs in many popular motif databases using the motif database scanning algorithm Tomtom. Transcription factor motifs can be further analyzed for putative function by association with Gene Ontology (GO) terms using the motif-GO term association tool GOMO. MEME output now contains sequence LOGOS for each discovered motif, as well as buttons to allow motifs to be conveniently submitted to the sequence and motif database scanning algorithms (MAST, FIMO and Tomtom), or to GOMO, for further analysis. GLAM2 output similarly contains buttons for further analysis using GLAM2SCAN and for rerunning GLAM2 with different parameters. All of the motif-based tools are now implemented as web services via Opal. Source code, binaries and a web server are freely available for noncommercial use at http://meme.nbcr.net.

7,733 citations


Journal ArticleDOI
TL;DR: The Carbohydrate-Active Enzyme (CAZy) database is a knowledge-based resource specialized in the enzymes that build and breakdown complex carbohydrates and glycoconjugates and has been used to improve the quality of functional predictions of a number genome projects by providing expert annotation.
Abstract: The Carbohydrate-Active Enzyme (CAZy) database is a knowledge-based resource specialized in the enzymes that build and breakdown complex carbohydrates and glycoconjugates. As of September 2008, the database describes the present knowledge on 113 glycoside hydrolase, 91 glycosyltransferase, 19 polysaccharide lyase, 15 carbohydrate esterase and 52 carbohydrate-binding module families. These families are created based on experimentally characterized proteins and are populated by sequences from public databases with significant similarity. Protein biochemical information is continuously curated based on the available literature and structural information. Over 6400 proteins have assigned EC numbers and 700 proteins have a PDB structure. The classification (i) reflects the structural features of these enzymes better than their sole substrate specificity, (ii) helps to reveal the evolutionary relationships between these enzymes and (iii) provides a convenient framework to understand mechanistic properties. This resource has been available for over 10 years to the scientific community, contributing to information dissemination and providing a transversal nomenclature to glycobiologists. More recently, this resource has been used to improve the quality of functional predictions of a number genome projects by providing expert annotation. The CAZy resource resides at URL: http://www.cazy.org/.

6,028 citations


Journal ArticleDOI
TL;DR: An improved alignment strategy uses the Infernal secondary structure aware aligner to provide a more consistent higher quality alignment and faster processing of user sequences, and a new Pyrosequencing Pipeline that provides tools to support analysis of ultra high-throughput rRNA sequencing data.
Abstract: The Ribosomal Database Project (RDP) provides researchers with quality-controlled bacterial and archaeal small subunit rRNA alignments and analysis tools. An improved alignment strategy uses the Infernal secondary structure aware aligner to provide a more consistent higher quality alignment and faster processing of user sequences. Substantial new analysis features include a new Pyrosequencing Pipeline that provides tools to support analysis of ultra high-throughput rRNA sequencing data. This pipeline offers a collection of tools that automate the data processing and simplify the computationally intensive analysis of large sequencing libraries. In addition, a new Taxomatic visualization tool allows rapid visualization of taxonomic inconsistencies and suggests corrections, and a new class Assignment Generator provides instructors with a lesson plan and individualized teaching materials. Details about RDP data and analytical functions can be found at http://rdp.cme.msu.edu/.

4,616 citations


Journal ArticleDOI
TL;DR: A number of new features in HPRD are added, including PhosphoMotif Finder, which allows users to find the presence of over 320 experimentally verified phosphorylation motifs in proteins of interest, and a protein distributed annotation system—Human Proteinpedia.
Abstract: Human Protein Reference Database (HPRD--http://www.hprd.org/), initially described in 2003, is a database of curated proteomic information pertaining to human proteins. We have recently added a number of new features in HPRD. These include PhosphoMotif Finder, which allows users to find the presence of over 320 experimentally verified phosphorylation motifs in proteins of interest. Another new feature is a protein distributed annotation system--Human Proteinpedia (http://www.humanproteinpedia.org/)--through which laboratories can submit their data, which is mapped onto protein entries in HPRD. Over 75 laboratories involved in proteomics research have already participated in this effort by submitting data for over 15,000 human proteins. The submitted data includes mass spectrometry and protein microarray-derived data, among other data types. Finally, HPRD is also linked to a compendium of human signaling pathways developed by our group, NetPath (http://www.netpath.org/), which currently contains annotations for several cancer and immune signaling pathways. Since the last update, more than 5500 new protein sequences have been added, making HPRD a comprehensive resource for studying the human proteome.

3,081 citations


Journal ArticleDOI
TL;DR: This article showed that baseline estimation errors are directly reflected in the observed PCR efficiency values and are thus propagated exponentially in the estimated starting concentrations as well as 'fold-difference' results.
Abstract: Despite the central role of quantitative PCR (qPCR) in the quantification of mRNA transcripts, most analyses of qPCR data are still delegated to the software that comes with the qPCR apparatus This is especially true for the handling of the fluorescence baseline This article shows that baseline estimation errors are directly reflected in the observed PCR efficiency values and are thus propagated exponentially in the estimated starting concentrations as well as 'fold-difference' results Because of the unknown origin and kinetics of the baseline fluorescence, the fluorescence values monitored in the initial cycles of the PCR reaction cannot be used to estimate a useful baseline value An algorithm that estimates the baseline by reconstructing the log-linear phase downward from the early plateau phase of the PCR reaction was developed and shown to lead to very reproducible PCR efficiency values PCR efficiency values were determined per sample by fitting a regression line to a subset of data points in the log-linear phase The variability, as well as the bias, in qPCR results was significantly reduced when the mean of these PCR efficiencies per amplicon was used in the calculation of an estimate of the starting concentration per sample

2,652 citations


Journal ArticleDOI
TL;DR: The utility of ToppGene Suite is demonstrated using 20 recently reported GWAS-based gene–disease associations (including novel disease genes) representing five diseases.
Abstract: ToppGene Suite (http://toppgene.cchmc.org; this web site is free and open to all users and does not require a login to access) is a one-stop portal for (i) gene list functional enrichment, (ii) candidate gene prioritization using either functional annotations or network analysis and (iii) identification and prioritization of novel disease candidate genes in the interactome. Functional annotation-based disease candidate gene prioritization uses a fuzzy-based similarity measure to compute the similarity between any two genes based on semantic annotations. The similarity scores from individual features are combined into an overall score using statistical meta-analysis. A P-value of each annotation of a test gene is derived by random sampling of the whole genome. The protein-protein interaction network (PPIN)-based disease candidate gene prioritization uses social and Web networks analysis algorithms (extended versions of the PageRank and HITS algorithms, and the K-Step Markov method). We demonstrate the utility of ToppGene Suite using 20 recently reported GWAS-based gene-disease associations (including novel disease genes) representing five diseases. ToppGene ranked 19 of 20 (95%) candidate genes within the top 20%, while ToppNet ranked 12 of 16 (75%) candidate genes among the top 20%.

2,435 citations


Journal ArticleDOI
TL;DR: The most important new developments in STRING 8 over previous releases include a URL-based programming interface, improved interaction prediction via genomic neighborhood in prokaryotes, and the inclusion of protein structures.
Abstract: Functional partnerships between proteins are at the core of complex cellular phenotypes, and the networks formed by interacting proteins provide researchers with crucial scaffolds for modeling, data reduction and annotation. STRING is a database and web resource dedicated to protein–protein interactions, including both physical and functional interactions. It weights and integrates information from numerous sources, including experimental repositories, computational prediction methods and public text collections, thus acting as a metadatabase that maps all interaction evidence onto a common set of genomes and proteins. The most important new developments in STRING 8 over previous releases include a URL-based programming interface, which can be used to query STRING from other resources, improved interaction prediction via genomic neighborhood in prokaryotes, and the inclusion of protein structures. Version 8.0 of STRING covers about 2.5 million proteins from 630 organisms, providing the most comprehensive view on protein–protein interactions currently available. STRING can be reached at http://string-db.org/.

2,394 citations


Journal ArticleDOI
TL;DR: Human Splicing Finder is designed, a tool to predict the effects of mutations on splicing signals or to identify splicing motifs in any human sequence, and it is shown that the mutation effect was correctly predicted in almost all cases.
Abstract: Thousands of mutations are identified yearly. Although many directly affect protein expression, an increasing proportion of mutations is now believed to influence mRNA splicing. They mostly affect existing splice sites, but synonymous, non-synonymous or nonsense mutations can also create or disrupt splice sites or auxiliary cis-splicing sequences. To facilitate the analysis of the different mutations, we designed Human Splicing Finder (HSF), a tool to predict the effects of mutations on splicing signals or to identify splicing motifs in any human sequence. It contains all available matrices for auxiliary sequence prediction as well as new ones for binding sites of the 9G8 and Tra2-beta Serine-Arginine proteins and the hnRNP A1 ribonucleoprotein. We also developed new Position Weight Matrices to assess the strength of 5' and 3' splice sites and branch points. We evaluated HSF efficiency using a set of 83 intronic and 35 exonic mutations known to result in splicing defects. We showed that the mutation effect was correctly predicted in almost all cases. HSF could thus represent a valuable resource for research, diagnostic and therapeutic (e.g. therapeutic exon skipping) purposes as well as for global studies, such as the GEN2PHEN European Project or the Human Variome Project.

2,300 citations


Journal ArticleDOI
TL;DR: The aim of the SWISS-MODEL Repository is to provide access to an up-to-date collection of annotated 3D protein models generated by automated homology modelling for all sequences in Swiss-Prot and for relevant models organisms.
Abstract: SWISS-MODEL Repository (http://swissmodel.expasy.org/repository/) is a database of 3D protein structure models generated by the SWISS-MODEL homology-modelling pipeline. The aim of the SWISS-MODEL Repository is to provide access to an up-to-date collection of annotated 3D protein models generated by automated homology modelling for all sequences in Swiss-Prot and for relevant models organisms. Regular updates ensure that target coverage is complete, that models are built using the most recent sequence and template structure databases, and that improvements in the underlying modelling pipeline are fully utilised. As of September 2008, the database contains 3.4 million entries for 2.7 million different protein sequences from the UniProt database. SWISS-MODEL Repository allows the users to assess the quality of the models in the database, search for alternative template structures, and to build models interactively via SWISS-MODEL Workspace (http://swissmodel.expasy.org/workspace/). Annotation of models with functional information and cross-linking with other databases such as the Protein Model Portal (http://www.proteinmodelportal.org) of the PSI Structural Genomics Knowledge Base facilitates the navigation between protein sequence and structure resources.

1,942 citations


Journal ArticleDOI
TL;DR: The InterPro database integrates together predictive models or ‘signatures’ representing protein domains, families and functional sites from multiple, diverse source databases: Gene3D, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY and TIGRFAMs.
Abstract: The InterPro database (http://www.ebi.ac.uk/interpro/) integrates together predictive models or 'signatures' representing protein domains, families and functional sites from multiple, diverse source databases: Gene3D, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY and TIGRFAMs. Integration is performed manually and approximately half of the total approximately 58,000 signatures available in the source databases belong to an InterPro entry. Recently, we have started to also display the remaining un-integrated signatures via our web interface. Other developments include the provision of non-signature data, such as structural data, in new XML files on our FTP site, as well as the inclusion of matchless UniProtKB proteins in the existing match XML files. The web interface has been extended and now links out to the ADAN predicted protein-protein interaction database and the SPICE and Dasty viewers. The latest public release (v18.0) covers 79.8% of UniProtKB (v14.1) and consists of 16 549 entries. InterPro data may be accessed either via the web address above, via web services, by downloading files by anonymous FTP or by using the InterProScan search software (http://www.ebi.ac.uk/Tools/InterProScan/).

Journal ArticleDOI
TL;DR: The most recent release of HMDB has been significantly expanded and enhanced over the previous release, with the number of fully annotated metabolite entries growing from 2180 to more than 6800, a 300% increase.
Abstract: The Human Metabolome Database (HMDB, http://www.hmdb.ca) is a richly annotated resource that is designed to address the broad needs of biochemists, clinical chemists, physicians, medical geneticists, nutritionists and members of the metabolomics community. Since its first release in 2007, the HMDB has been used to facilitate the research for nearly 100 published studies in metabolomics, clinical biochemistry and systems biology. The most recent release of HMDB (version 2.0) has been significantly expanded and enhanced over the previous release (version 1.0). In particular, the number of fully annotated metabolite entries has grown from 2180 to more than 6800 (a 300% increase), while the number of metabolites with biofluid or tissue concentration data has grown by a factor of five (from 883 to 4413). Similarly, the number of purified compounds with reference to NMR, LC-MS and GC-MS spectra has more than doubled (from 380 to more than 790 compounds). In addition to this significant expansion in database size, many new database searching tools and new data content has been added or enhanced. These include better algorithms for spectral searching and matching, more powerful chemical substructure searches, faster text searching software, as well as dedicated pathway searching tools and customized, clickable metabolic maps. Changes to the user-interface have also been implemented to accommodate future expansion and to make database navigation much easier. These improvements should make the HMDB much more useful to a much wider community of users.

Journal ArticleDOI
TL;DR: A freely accessible, easy-to-use web server for metabolomic data analysis called MetaboAnalyst, which supports such techniques as: fold change analysis, t-tests, PCA, PLS-DA, hierarchical clustering and a number of more sophisticated statistical or machine learning methods.
Abstract: Metabolomics is a newly emerging field of 'omics' research that is concerned with characterizing large numbers of metabolites using NMR, chromatography and mass spectrometry. It is frequently used in biomarker identification and the metabolic profiling of cells, tissues or organisms. The data processing challenges in metabolomics are quite unique and often require specialized (or expensive) data analysis software and a detailed knowledge of cheminformatics, bioinformatics and statistics. In an effort to simplify metabolomic data analysis while at the same time improving user accessibility, we have developed a freely accessible, easy-to-use web server for metabolomic data analysis called MetaboAnalyst. Fundamentally, MetaboAnalyst is a web-based metabolomic data processing tool not unlike many of today's web-based microarray analysis packages. It accepts a variety of input data (NMR peak lists, binned spectra, MS peak lists, compound/concentration data) in a wide variety of formats. It also offers a number of options for metabolomic data processing, data normalization, multivariate statistical analysis, graphing, metabolite identification and pathway mapping. In particular, MetaboAnalyst supports such techniques as: fold change analysis, t-tests, PCA, PLS-DA, hierarchical clustering and a number of more sophisticated statistical or machine learning methods. It also employs a large library of reference spectra to facilitate compound identification from most kinds of input spectra. MetaboAnalyst guides users through a step-by-step analysis pipeline using a variety of menus, information hyperlinks and check boxes. Upon completion, the server generates a detailed report describing each method used, embedded with graphical and tabular outputs. MetaboAnalyst is capable of handling most kinds of metabolomic data and was designed to perform most of the common kinds of metabolomic data analyses. MetaboAnalyst is accessible at http://www.metaboanalyst.ca.

Journal ArticleDOI
TL;DR: The Pathway Interaction Database (PID), a freely available collection of curated and peer-reviewed pathways composed of human molecular signaling and regulatory events and key cellular processes, serves as a research tool for the cancer research community and others interested in cellular pathways.
Abstract: The Pathway Interaction Database (PID, http://pid.nci.nih.gov) is a freely available collection of curated and peer-reviewed pathways composed of human molecular signaling and regulatory events and key cellular processes. Created in a collaboration between the US National Cancer Institute and Nature Publishing Group, the database serves as a research tool for the cancer research community and others interested in cellular pathways, such as neuroscientists, developmental biologists and immunologists. PID offers a range of search features to facilitate pathway exploration. Users can browse the predefined set of pathways or create interaction network maps centered on a single molecule or cellular process of interest. In addition, the batch query tool allows users to upload long list(s) of molecules, such as those derived from microarray experiments, and either overlay these molecules onto predefined pathways or visualize the complete molecular connectivity map. Users can also download molecule lists, citation lists and complete database content in extensible markup language (XML) and Biological Pathways Exchange (BioPAX) Level 2 format. The database is updated with new pathway content every month and supplemented by specially commissioned articles on the practical uses of other relevant online tools.

Journal ArticleDOI
TL;DR: Here the authors review studies that provided important information about conformational properties of DNA using circular dichroic (CD) spectroscopy, which significantly participated in all basic conformational findings on DNA.
Abstract: Here we review studies that provided important information about conformational properties of DNA using circular dichroic (CD) spectroscopy. The conformational properties include the B-family of structures, A-form, Z-form, guanine quadruplexes, cytosine quadruplexes, triplexes and other less characterized structures. CD spectroscopy is extremely sensitive and relatively inexpensive. This fast and simple method can be used at low- as well as high-DNA concentrations and with short- as well as long-DNA molecules. The samples can easily be titrated with various agents to cause conformational isomerizations of DNA. The course of detected CD spectral changes makes possible to distinguish between gradual changes within a single DNA conformation and cooperative isomerizations between discrete structural states. It enables measuring kinetics of the appearance of particular conformers and determination of their thermodynamic parameters. In careful hands, CD spectroscopy is a valuable tool for mapping conformational properties of particular DNA molecules. Due to its numerous advantages, CD spectroscopy significantly participated in all basic conformational findings on DNA.

Journal ArticleDOI
TL;DR: The miRecords as mentioned in this paper database contains 1135 records of validated miRNA-target interactions between 301 miRNAs and 902 target genes in seven animal species in seven species.
Abstract: MicroRNAs (miRNAs) are an important class of small noncoding RNAs capable of regulating other genes’ expression. Much progress has been made in computational target prediction of miRNAs in recent years. More than 10 miRNA target prediction programs have been established, yet, the prediction of animal miRNA targets remains a challenging task. We have developed miRecords, an integrated resource for animal miRNA–target interactions. The Validated Targets component of this resource hosts a large, high-quality manually curated database of experimentally validated miRNA–target interactions with systematic documentation of experimental support for each interaction. The current release of this database includes 1135 records of validated miRNA–target interactions between 301 miRNAs and 902 target genes in seven animal species. The Predicted Targets component of miRecords stores predicted miRNA targets produced by 11 established miRNA target prediction programs. miRecords is expected to serve as a useful resource not only for experimental miRNA researchers, but also for informatics scientists developing the next-generation miRNA target prediction programs. The miRecords is available at http:// miRecords.umn.edu/miRecords.

Journal ArticleDOI
TL;DR: By attaching a rolling-circle substrate to a TIRF microscope-mounted flow chamber, this method allows for rapid and precise characterization of the kinetics of DNA synthesis and the effects of replication inhibitors.
Abstract: We present a simple technique for visualizing replication of individual DNA molecules in real time. By attaching a rolling-circle substrate to a TIRF microscope-mounted flow chamber, we are able to monitor the progression of single-DNA synthesis events and accurately measure rates and processivities of single T7 and Escherichia coli replisomes as they replicate DNA. This method allows for rapid and precise characterization of the kinetics of DNA synthesis and the effects of replication inhibitors.

Journal ArticleDOI
TL;DR: ‘miR2Disease’, a manually curated database, aims at providing a comprehensive resource of microRNA deregulation in various human diseases by reviewing more than 600 published papers.
Abstract: 'miR2Disease', a manually curated database, aims at providing a comprehensive resource of microRNA deregulation in various human diseases. The current version of miR2Disease documents 1939 curated relationships between 299 human microRNAs and 94 human diseases by reviewing more than 600 published papers. Around one-seventh of the microRNA-disease relationships represent the pathogenic roles of deregulated microRNA in human disease. Each entry in the miR2Disease contains detailed information on a microRNA-disease relationship, including a microRNA ID, the disease name, a brief description of the microRNA-disease relationship, an expression pattern of the microRNA, the detection method for microRNA expression, experimentally verified target gene(s) of the microRNA and a literature reference. miR2Disease provides a user-friendly interface for a convenient retrieval of each entry by microRNA ID, disease name, or target gene. In addition, miR2Disease offers a submission page that allows researchers to submit established microRNA-disease relationships that are not documented. Once approved by the submission review committee, the submitted records will be included in the database. miR2Disease is freely available at http://www.miR2Disease.org.

Journal ArticleDOI
TL;DR: This work presents the first multiplexed QPCR method for telomere length measurement, and shows that a single fluorescent DNA-intercalating dye is sufficient in this system, because T signals can be collected in early cycles, before S signals rise above baseline, and S signals can been collected at a temperature that fully melts the telomeres product, sending its signal to baseline.
Abstract: The current quantitative polymerase chain reaction (QPCR) assay of telomere length measures telomere (T) signals in experimental DNA samples in one set of reaction wells, and single copy gene (S) signals in separate wells, in comparison to a reference DNA, to yield relative T/S ratios that are proportional to average telomere length. Multiplexing this assay is desirable, because variation in the amount of DNA pipetted would no longer contribute to variation in T/S, since T and S would be collected within each reaction, from the same input DNA. Multiplexing also increases throughput and lowers costs, since half as many reactions are needed. Here, we present the first multiplexed QPCR method for telomere length measurement. Remarkably, a single fluorescent DNA-intercalating dye is sufficient in this system, because T signals can be collected in early cycles, before S signals rise above baseline, and S signals can be collected at a temperature that fully melts the telomere product, sending its signal to baseline. The correlation of T/S ratios with Terminal Restriction Fragment (TRF) lengths measured by Southern blot was stronger with this monochrome multiplex QPCR method (R(2) = 0.844) than with our original singleplex method (R(2) = 0.677). Multiplex T/S results from independent runs on different days were highly reproducible (R(2) = 0.91).

Journal ArticleDOI
TL;DR: A set of web servers are presented to facilitate and optimize the utility of biological activity information within PubChem and provide tools for rapid data retrieval, integration and comparison of biological screening results, exploratory structure–activity analysis, and target selectivity examination.
Abstract: PubChem (http://pubchem.ncbi.nlm.nih.gov) is a public repository for biological properties of small molecules hosted by the US National Institutes of Health (NIH). PubChem BioAssay database currently contains biological test results for more than 700 000 compounds. The goal of PubChem is to make this information easily accessible to biomedical researchers. In this work, we present a set of web servers to facilitate and optimize the utility of biological activity information within PubChem. These web-based services provide tools for rapid data retrieval, integration and comparison of biological screening results, exploratory structure-activity analysis, and target selectivity examination. This article reviews these bioactivity analysis tools and discusses their uses. Most of the tools described in this work can be directly accessed at http://pubchem.ncbi.nlm.nih.gov/assay/. URLs for accessing other tools described in this work are specified individually.

Journal ArticleDOI
TL;DR: NCBI's Conserved Domain Database is a collection of multiple sequence alignments and derived database search models, which represent protein domains conserved in molecular evolution, and provides annotation of domain footprints and conserved functional sites on protein sequences.
Abstract: NCBI's Conserved Domain Database (CDD) is a collection of multiple sequence alignments and derived database search models, which represent protein domains conserved in molecular evolution The collection can be accessed at http://wwwncbinlmnihgov/Structure/cdd/cddshtml, and is also part of NCBI's Entrez query and retrieval system, cross-linked to numerous other resources CDD provides annotation of domain footprints and conserved functional sites on protein sequences Precalculated domain annotation can be retrieved for protein sequences tracked in NCBI's Entrez system, and CDD's collection of models can be queried with novel protein sequences via the CD-Search service at http://wwwncbinlmnihgov/Structure/cdd/wrpsbcgi Starting with the latest version of CDD, v214, information from redundant and homologous domain models is summarized at a superfamily level, and domain annotation on proteins is flagged as either ‘specific’ (identifying molecular function with high confidence) or as ‘non-specific’ (identifying superfamily membership only)

Journal ArticleDOI
TL;DR: The Antibiotic Resistance Genes Database (ARDB) is a manually curated database unifying most of the publicly available information on antibiotic resistance and can be used as compendium of antibiotic resistance factors as well as to identify the resistance genes of newly sequenced genes, genomes, or metagenomes.
Abstract: The treatment of infections is increasingly compromised by the ability of bacteria to develop resistance to antibiotics through mutations or through the acquisition of resistance genes. Antibiotic resistance genes also have the potential to be used for bio-terror purposes through genetically modified organisms. In order to facilitate the identification and characterization of these genes, we have created a manually curated database—the Antibiotic Resistance Genes Database (ARDB)—unifying most of the publicly available information on antibiotic resistance. Each gene and resistance type is annotated with rich information, including resistance profile, mechanism of action, ontology, COG and CDD annotations, as well as external links to sequence and protein databases. Our database also supports sequence similarity searches and implements an initial version of a tool for characterizing common mutations that confer antibiotic resistance. The information we provide can be used as compendium of antibiotic resistance factors as well as to identify the resistance genes of newly sequenced genes, genomes, or metagenomes. Currently, ARDB contains resistance information for 13 293 genes, 377 types, 257 antibiotics, 632 genomes, 933 species and 124 genera. ARDB is available at http://ardb.cbcb.umd.edu/.

Journal ArticleDOI
TL;DR: Cadnano as mentioned in this paper is an open-source software package with a graphical user interface that aids in the design of DNA sequences for folding 3D honeycomb-pleated shapes A series of rectangular-block motifs were designed, assembled, and analyzed to identify a well-behaved motif that could serve as a building block for future studies.
Abstract: DNA nanotechnology exploits the programmable specificity afforded by base-pairing to produce self-assembling macromolecular objects of custom shape. For building megadalton-scale DNA nanostructures, a long 'scaffold' strand can be employed to template the assembly of hundreds of oligonucleotide 'staple' strands into a planar antiparallel array of cross-linked helices. We recently adapted this 'scaffolded DNA origami' method to producing 3D shapes formed as pleated layers of double helices constrained to a honeycomb lattice. However, completing the required design steps can be cumbersome and time-consuming. Here we present caDNAno, an open-source software package with a graphical user interface that aids in the design of DNA sequences for folding 3D honeycomb-pleated shapes A series of rectangular-block motifs were designed, assembled, and analyzed to identify a well-behaved motif that could serve as a building block for future studies. The use of caDNAno significantly reduces the effort required to design 3D DNA-origami structures. The software is available at http://cadnano.org/, along with example designs and video tutorials demonstrating their construction. The source code is released under the MIT license.

Journal ArticleDOI
TL;DR: The Gene Expression Omnibus at the National Center for Biotechnology Information (NCBI) is the largest public repository for high-throughput gene expression data and offers many tools and features that allow users to effectively explore, analyze and download expression data from both gene-centric and experiment-centric perspectives.
Abstract: The Gene Expression Omnibus (GEO) at the National Center for Biotechnology Information (NCBI) is the largest public repository for high-throughput gene expression data. Additionally, GEO hosts other categories of high-throughput functional genomic data, including those that examine genome copy number variations, chromatin structure, methylation status and transcription factor binding. These data are generated by the research community using high-throughput technologies like microarrays and, more recently, next-generation sequencing. The database has a flexible infrastructure that can capture fully annotated raw and processed data, enabling compliance with major community-derived scientific reporting standards such as ‘Minimum Information About a Microarray Experiment’ (MIAME). In addition to serving as a centralized data storage hub, GEO offers many tools and features that allow users to effectively explore, analyze and download expression data from both gene-centric and experiment-centric perspectives. This article summarizes the GEO repository structure, content and operating procedures, as well as recently introduced data mining features. GEO is freely accessible at http://www.ncbi.nlm.nih.gov/geo/.

Journal ArticleDOI
TL;DR: It is 10 years since the IMGT/HLA database was released, providing the HLA community with a searchable repository of highly curated HLA sequences, and regular updates to the website ensure that new and confirmatory sequences are dispersed to the Hla community, and the wider research and clinical communities.
Abstract: It is 10 years since the IMGT/HLA database was released, providing the HLA community with a searchable repository of highly curated HLA sequences. The HLA complex is located within the 6p21.3 region of human chromosome 6 and contains more than 220 genes of diverse function. Many of the genes encode proteins of the immune system and are highly polymorphic. The naming of these HLA genes and alleles, and their quality control is the responsibility of the WHO Nomenclature Committee for Factors of the HLA System. Through the work of the HLA Informatics Group and in collaboration with the European Bioinformatics Institute, we are able to provide public access to this data through the website http://www.ebi.ac.uk/imgt/hla/. The first release contained 964 sequences, the most recent release 3300 sequences, with around 450 new sequences been added each year. The tools provided on the website have been updated to allow more complex alignments, which include genomic sequence data, as well as the development of tools for probe and primer design and the inclusion of data from the HLA Dictionary. Regular updates to the website ensure that new and confirmatory sequences are dispersed to the HLA community, and the wider research and clinical communities.

Journal ArticleDOI
TL;DR: The domain annotations were extended with data on metabolic pathways and links to several pathways resources, and the interaction network view was completely redesigned and is now available for more than 2 million proteins.
Abstract: Simple modular architecture research tool (SMART) is an online tool (http://smart.embl.de/) for the identification and annotation of protein domains. It provides a user-friendly platform for the exploration and comparative study of domain architectures in both proteins and genes. The current release of SMART contains manually curated models for 784 protein domains. Recent developments were focused on further data integration and improving user friendliness. The underlying protein database based on completely sequenced genomes was greatly expanded and now includes 630 species, compared to 191 in the previous release. As an initial step towards integrating information on biological pathways into SMART, our domain annotations were extended with data on metabolic pathways and links to several pathways resources. The interaction network view was completely redesigned and is now available for more than 2 million proteins. In addition to the standard web access to the database, users can now query SMART using distributed annotation system (DAS) or through a simple object access protocol (SOAP) based web service.

Journal ArticleDOI
TL;DR: PlasmoDB as mentioned in this paper is a functional genomic database for Plasmodium spp. that provides a resource for data analysis and visualization in a gene-by-gene or genome-wide scale.
Abstract: PlasmoDB (http://PlasmoDB.org) is a functional genomic database for Plasmodium spp. that provides a resource for data analysis and visualization in a gene-by-gene or genome-wide scale. PlasmoDB belongs to a family of genomic resources that are housed under the EuPathDB (http://EuPathDB.org) Bioinformatics Resource Center (BRC) umbrella. The latest release, PlasmoDB 5.5, contains numerous new data types from several broad categories--annotated genomes, evidence of transcription, proteomics evidence, protein function evidence, population biology and evolution. Data in PlasmoDB can be queried by selecting the data of interest from a query grid or drop down menus. Various results can then be combined with each other on the query history page. Search results can be downloaded with associated functional data and registered users can store their query history for future retrieval or analysis.

Journal ArticleDOI
TL;DR: Improved orthology prediction methods allowing pathway inference for 22 species and through collaborations to create manually curated Reactome pathway datasets for species including Arabidopsis, Oryza sativa, Drosophila and Gallus gallus.
Abstract: Reactome (http://www.reactome.org) is an expert-authored, peer-reviewed knowledgebase of human reactions and pathways that functions as a data mining resource and electronic textbook. Its current release includes 2975 human proteins, 2907 reactions and 4455 literature citations. A new entity-level pathway viewer and improved search and data mining tools facilitate searching and visualizing pathway data and the analysis of user-supplied high-throughput data sets. Reactome has increased its utility to the model organism communities with improved orthology prediction methods allowing pathway inference for 22 species and through collaborations to create manually curated Reactome pathway datasets for species including Arabidopsis, Oryza sativa (rice), Drosophila and Gallus gallus (chicken). Reactome's data content and software can all be freely used and redistributed under open source terms.

Journal ArticleDOI
TL;DR: Using frequently occurring residues, database-aided peptide design in different ways is demonstrated, and GLK-19 showed a higher activity against Escherichia coli than human LL-37 and Leu, Ala, Gly and Lys in amphibian peptides.
Abstract: The antimicrobial peptide database (APD, http://aps.unmc.edu/AP/main.php) has been updated and expanded. It now hosts 1228 entries with 65 anticancer, 76 antiviral (53 anti-HIV), 327 antifungal and 944 antibacterial peptides. The second version of our database (APD2) allows users to search peptide families (e.g. bacteriocins, cyclotides, or defensins), peptide sources (e.g. fish, frogs or chicken), post-translationally modified peptides (e.g. amidation, oxidation, lipidation, glycosylation or d-amino acids), and peptide binding targets (e.g. membranes, proteins, DNA/RNA, LPS or sugars). Statistical analyses reveal that the frequently used amino acid residues (>10%) are Ala and Gly in bacterial peptides, Cys and Gly in plant peptides, Ala, Gly and Lys in insect peptides, and Leu, Ala, Gly and Lys in amphibian peptides. Using frequently occurring residues, we demonstrate database-aided peptide design in different ways. Among the three peptides designed, GLK-19 showed a higher activity against Escherichia coli than human LL-37.

Journal ArticleDOI
TL;DR: A comprehensive update of the database contents and features and summarize recent discoveries recorded in TCDB are presented.
Abstract: The Transporter Classification Database (TCDB), freely accessible at http://www.tcdb.org, is a relational database containing sequence, structural, functional and evolutionary information about transport systems from a variety of living organisms, based on the International Union of Biochemistry and Molecular Biology-approved transporter classification (TC) system. It is a curated repository for factual information compiled largely from published references. It uses a functional/phylogenetic system of classification, and currently encompasses about 5000 representative transporters and putative transporters in more than 500 families. We here describe novel software designed to support and extend the usefulness of TCDB. Our recent efforts render it more user friendly, incorporate machine learning to input novel data in a semiautomatic fashion, and allow analyses that are more accurate and less time consuming. The availability of these tools has resulted in recognition of distant phylogenetic relationships and tremendous expansion of the information available to TCDB users.