scispace - formally typeset
Search or ask a question

Showing papers by "David S. Wishart published in 2006"


Journal ArticleDOI
TL;DR: DrugBank is a unique bioinformatics/cheminformatics resource that combines detailed drug data with comprehensive drug target information and is fully searchable supporting extensive text, sequence, chemical structure and relational query searches.
Abstract: DrugBank is a unique bioinformatics/cheminformatics resource that combines detailed drug (i.e. chemical) data with comprehensive drug target (i.e. protein) information. The database contains .4100 drug entries including .800 FDA approved small molecule and biotech drugs as well as .3200 experimental drugs. Additionally, .14 000 protein or drug target sequences are linked to these drug entries. Each DrugCard entry contains .80 data fields with half of the information being devoted to drug/chemical data and the other half devoted to drug target or protein data. Many data fields are hyperlinked to other databases (KEGG, PubChem, ChEBI, PDB, Swiss-Prot and GenBank) and a variety of structure viewing applets. The database is fully searchable supporting extensive text, sequence, chemical structure and relational query searches. Potential applications of DrugBank include in silico drug target discovery, drug design, drug docking or screening, drug metabolism prediction, drug interaction prediction and general pharmaceutical education. DrugBank is available at http:// redpoll.pharmacy.ualberta.ca/drugbank/.

3,087 citations


Journal ArticleDOI
TL;DR: A broad survey of the different types of machine learning methods being used, the types of data being integrated and the performance of these methods in cancer prediction and prognosis is conducted, including a growing dependence on protein biomarkers and microarray data, a strong bias towards applications in prostate and breast cancer, and a heavy reliance on "older" technologies.
Abstract: Machine learning is a branch of artificial intelligence that employs a variety of statistical, probabilistic and optimization techniques that allows computers to “learn” from past examples and to detect hard-to-discern patterns from large, noisy or complex data sets. This capability is particularly well-suited to medical applications, especially those that depend on complex proteomic and genomic measurements. As a result, machine learning is frequently used in cancer diagnosis and detection. More recently machine learning has been applied to cancer prognosis and prediction. This latter approach is particularly interesting as it is part of a growing trend towards personalized, predictive medicine. In assembling this review we conducted a broad survey of the different types of machine learning methods being used, the types of data being integrated and the performance of these methods in cancer prediction and prognosis. A number of trends are noted, including a growing dependence on protein biomarkers and mi...

967 citations


Journal ArticleDOI
TL;DR: A snapshot analysis based on the most recent genome sequences of two E.coli K-12 strains allows comparison of their genotypes and mutant status of alleles.
Abstract: The goal of this group project has been to coordinate and bring up-to-date information on all genes of Escherichia coli K-12. Annotation of the genome of an organism entails identification of genes, the boundaries of genes in terms of precise start and end sites, and description of the gene products. Known and predicted functions were assigned to each gene product on the basis of experimental evidence or sequence analysis. Since both kinds of evidence are constantly expanding, no annotation is complete at any moment in time. This is a snapshot analysis based on the most recent genome sequences of two E.coli K-12 bacteria. An accurate and up-to-date description of E.coli K-12 genes is of particular importance to the scientific community because experimentally determined properties of its gene products provide fundamental information for annotation of innumerable genes of other organisms. Availability of the complete genome sequence of two K-12 strains allows comparison of their genotypes and mutant status of alleles.

636 citations


Journal ArticleDOI
TL;DR: A web server, called PREDITOR, which greatly accelerates and simplifies the determination of torsion angle restraints including phi, psi, omega and chi angles and is 35 times faster and up to 20% more accurate than any existing method.
Abstract: Every year between 500 and 1000 peptide and protein structures are determined by NMR and deposited into the Protein Data Bank. However, the process of NMR structure determination continues to be a manually intensive and time-consuming task. One of the most tedious and error-prone aspects of this process involves the determination of torsion angle restraints including phi, psi, omega and chi angles. Most methods require many days of additional experiments, painstaking measurements or complex calculations. Here we wish to describe a web server, called PREDITOR, which greatly accelerates and simplifies this task. PREDITOR accepts sequence and/or chemical shift data as input and generates torsion angle predictions (with predicted errors) for phi, psi, omega and chi-1 angles. PREDITOR combines sequence alignment methods with advanced chemical shift analysis techniques to generate its torsion angle predictions. The method is fast (<40 s per protein) and accurate, with 88% of phi/psi predictions being within 30 degrees of the correct values, 84% of chi-1 predictions being correct and 99.97% of omega angles being correct. PREDITOR is 35 times faster and up to 20% more accurate than any existing method. PREDITOR also provides accurate assessments of the torsion angle errors so that the torsion angle constraints can be readily fed into standard structure refinement programs, such as CNS, XPLOR, AMBER and CYANA. Other unique features to PREDITOR include dihedral angle prediction via PDB structure mapping, automated chemical shift re-referencing (to improve accuracy), prediction of proline cis/trans states and a simple user interface. The PREDITOR website is located at: http://wishart.biology.ualberta.ca/preditor.

187 citations


Journal ArticleDOI
TL;DR: This work has developed a method that performs structure-based sequence alignments as part of the secondary structure prediction process that is approximately 4–5% better than any other method currently available.
Abstract: The accuracy of protein secondary structure prediction has steadily improved over the past 30 years. Now many secondary structure prediction methods routinely achieve an accuracy (Q3) of about 75%. We believe this accuracy could be further improved by including structure (as opposed to sequence) database comparisons as part of the prediction process. Indeed, given the large size of the Protein Data Bank (>35,000 sequences), the probability of a newly identified sequence having a structural homologue is actually quite high. We have developed a method that performs structure-based sequence alignments as part of the secondary structure prediction process. By mapping the structure of a known homologue (sequence ID >25%) onto the query protein's sequence, it is possible to predict at least a portion of that query protein's secondary structure. By integrating this structural alignment approach with conventional (sequence-based) secondary structure methods and then combining it with a "jury-of-experts" system to generate a consensus result, it is possible to attain very high prediction accuracy. Using a sequence-unique test set of 1644 proteins from EVA, this new method achieves an average Q3 score of 81.3%. Extensive testing indicates this is approximately 4–5% better than any other method currently available. Assessments using non sequence-unique test sets (typical of those used in proteome annotation or structural genomics) indicate that this new method can achieve a Q3 score approaching 88%. By using both sequence and structure databases and by exploiting the latest techniques in machine learning it is possible to routinely predict protein secondary structure with an accuracy well above 80%. A program and web server, called PROTEUS, that performs these secondary structure predictions is accessible at http://wishart.biology.ualberta.ca/proteus . For high throughput or batch sequence analyses, the PROTEUS programs, databases (and server) can be downloaded and run locally.

146 citations


Proceedings Article
16 Jul 2006
TL;DR: A framework, ExplainD, is described for explaining decisions made by classifiers that use additive evidence, which applies to many widely used classifiers, including linear discriminants and many additive models.
Abstract: Machine-learned classifiers are important components of many data mining and knowledge discovery systems. In several application domains, an explanation of the classifier's reasoning is critical for the classifier's acceptance by the end-user. We describe a framework, ExplainD, for explaining decisions made by classifiers that use additive evidence. ExplainD applies to many widely used classifiers, including linear discriminants and many additive models. We demonstrate our ExplainD framework using implementations of naive Bayes, linear support vector machine, and logistic regression classifiers on example applications. ExplainD uses a simple graphical explanation of the classification process to provide visualizations of the classifier decisions, visualization of the evidence for those decisions, the capability to speculate on the effect of changes to the data, and the capability, wherever possible, to drill down and audit the source of the evidence. We demonstrate the effectiveness of ExplainD in the context of a deployed web-based system (Proteome Analyst) and using a downloadable Python-based implementation.

126 citations


Journal ArticleDOI
TL;DR: The key advantages of this protocol over existing methods for studying protein dynamics are that it does not require prior knowledge of a protein's tertiary structure, it is not sensitive to the protein's overall tumbling, and itdoes not require additional NMR measurements beyond the standard experiments for backbone assignments.
Abstract: We present a protocol for predicting protein flexibility from NMR chemical shifts. The protocol consists of (i) ensuring that the chemical shift assignments are correctly referenced or, if not, performing a reference correction using information derived from the chemical shift index, (ii) calculating the random coil index (RCI), and (iii) predicting the expected root mean square fluctuations (RMSFs) and order parameters (S2) of the protein from the RCI. The key advantages of this protocol over existing methods for studying protein dynamics are that (i) it does not require prior knowledge of a protein's tertiary structure, (ii) it is not sensitive to the protein's overall tumbling and (iii) it does not require additional NMR measurements beyond the standard experiments for backbone assignments. When chemical shift assignments are available, protein flexibility parameters, such as S2 and RMSF, can be calculated within 1–2 h using a spreadsheet program.

79 citations


Journal ArticleDOI
TL;DR: The combined challenge of revising existing annotations and extracting useful information from the flood of new genome sequences will necessitate more reliance on completely automated systems.

66 citations


Journal ArticleDOI
TL;DR: The application of metabolomics to kidney transplant monitoring is still very much in its infancy, but there are a number of easily measured metabolites in both urine and serum that can provide reliable indications of organ function, organ injury, and immunosuppressive drug toxicity.
Abstract: Purpose of reviewThe success of any given kidney transplant is closely tied to the ability to monitor patients and responsively change their medications. Transplant monitoring is still, however, dependent on relatively old technologies: serum creatinine levels, urine output, blood pressure, blood gl

56 citations


Journal ArticleDOI
TL;DR: A program, called SHIFTOR, that is able to accurately predict a large number of protein torsion angles using only 1H, 13C and 15N chemical shift assignments as input and its predictions are approximately 20% better than existing methods.
Abstract: Torsion angle restraints are frequently used in the determination and refinement of protein structures by NMR. These restraints may be obtained by J coupling, cross-correlation measurements, nuclear Overhauser effects (NOEs) or secondary chemical shifts. Currently most backbone (phi/psi) torsion angles are determined using a combination of J(HNHalpha) couplings and chemical shift measurements while most side-chain (chi1) angles and cis/trans peptide bond angles (omega) are determined via NOEs. The dependency on multiple experimental (and computational) methods to obtain different torsion angle restraints is both time-consuming and error prone. The situation could be greatly improved if the determination of all torsion angles (phi, psi, chi and omega) could be made via a single type of measurement (i.e. chemical shifts). Here we describe a program, called SHIFTOR, that is able to accurately predict a large number of protein torsion angles (phi, psi, omega, chi1) using only 1H, 13C and 15N chemical shift assignments as input. Overall, the program is 100x faster and its predictions are approximately 20% better than existing methods. The program is also capable of predicting chi1 angles with 81% accuracy and omega angles with 100% accuracy. SHIFTOR exploits many of the recent developments and observations regarding chemical shift dependencies as well as using information in the Protein Databank to improve the quality of its shift-derived torsion angle predictions. SHIFTOR is available as a freely accessible web server at http://wishart.biology.ualberta.ca/shiftor.

48 citations


Proceedings ArticleDOI
01 Dec 2006
TL;DR: The developed BioSpider is essentially an automated report generator designed specifically to tabulate and summarize data on biomolecules - both large and small, and is believed to be a particularly valuable tool for researchers in metabolomics.
Abstract: One of the growing challenges in life science research lies in finding useful, descriptive or quantitative data about newly reported biomolecules (genes, proteins, metabolites and drugs). An even greater challenge is finding information that connects these genes, proteins, drugs or metabolites to each other. Much of this information is scattered through hundreds of different databases, abstracts or books and almost none of it is particularly well integrated. While some efforts are being undertaken at the NCBI and EBI to integrate many different databases together, this still falls short of the goal of having some kind of human-readable synopsis that summarizes the state of knowledge about a given biomolecule - especially small molecules. To address this shortfall, we have developed BioSpider. BioSpider is essentially an automated report generator designed specifically to tabulate and summarize data on biomolecules - both large and small. Specifically, BioSpider allows users to type in almost any kind of biological or chemical identifier (protein/gene name, sequence, accession number, chemical name, brand name, SMILES string, InCHI string, CAS number, etc.) and it returns an in-depth synoptic report (approximately 3-30 pages in length) about that biomolecule and any other biomolecule it may target. This summary includes physico-chemical parameters, images, models, data files, descriptions and predictions concerning the query molecule. BioSpider uses a web-crawler to scan through dozens of public databases and employs a variety of specially developed text mining tools and locally developed prediction tools to find, extract and assemble data for its reports. Because of its breadth, depth and comprehensiveness, we believe BioSpider will prove to be a particularly valuable tool for researchers in metabolomics. BioSpider is available at: www.biospider.ca


Proceedings ArticleDOI
01 Dec 2006
TL;DR: This year's Pacific Symposium in Biocomputing solicited papers that focused specifically on describing novel methods for the acquisition, management and analysis of metabolomic data, particularly interested in papers that covered one of the five following topics: metabolomics databases; 2) metabolomics LIMS; 3) spectral analysis tools for metabolomics; 4) medical or applied metabolomics.
Abstract: 1. Session Background and Motivation This marks the first time that the Pacific Symposium in Biocomputing has hosted a session specifically devoted to the emerging computational needs of metabolomics. Metabolomics, or metabonomics as it is sometimes called, is a relatively new field of “omics” research concerned with the high-throughput identification and quantification of the small molecule metabolites in the metabolome (i.e. the complete complement of all small molecule metabolites found in a specific cell, organ or organism). It is a close counterpart to the genome, the transcriptome and the proteome. Together these four “omes” constitute the building blocks of systems biology. Even though metabolomics is primarily concerned with tracking and identifying chemicals as opposed to genes or proteins, it still shares many of the same computational needs with genomics, proteomics and transcriptomics. For instance, just like the other “omics” fields, metabolomics needs electronically accessible and searchable databases, it needs software to handle or process data from various highthroughput instruments such as NMR spectrometers or mass spectrometers, it needs laboratory information management systems (LIMS) to manage the data, and it needs software tools to predict or find information about metabolite properties, pathways, relationships or functions. These computational needs are just beginning to be addressed by members of the metabolomics community. As a result we believed that a PSB session devoted to this topic could address a number of important issues concerning both the emerging computational needs and the nascent computational trends in metabolomics. This year we solicited papers that focused specifically on describing novel methods for the acquisition, management and analysis of metabolomic data. We were particularly interested in papers that covered one of the five following topics: 1) metabolomics databases; 2) metabolomics LIMS; 3) spectral analysis tools for metabolomics; 4) medical or applied metabolomics