scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Genomics in 2010"


Posted Content•
TL;DR: In this paper, a simple but powerful method, named "BOolean Operation based Screening and Testing" (BOOST), is introduced to discover unknown gene-gene interactions that underlie complex diseases.
Abstract: Gene-gene interactions have long been recognized to be fundamentally important to understand genetic causes of complex disease traits. At present, identifying gene-gene interactions from genome-wide case-control studies is computationally and methodologically challenging. In this paper, we introduce a simple but powerful method, named `BOolean Operation based Screening and Testing'(BOOST). To discover unknown gene-gene interactions that underlie complex diseases, BOOST allows examining all pairwise interactions in genome-wide case-control studies in a remarkably fast manner. We have carried out interaction analyses on seven data sets from the Wellcome Trust Case Control Consortium (WTCCC). Each analysis took less than 60 hours on a standard 3.0 GHz desktop with 4G memory running Windows XP system. The interaction patterns identified from the type 1 diabetes data set display significant difference from those identified from the rheumatoid arthritis data set, while both data sets share a very similar hit region in the WTCCC report. BOOST has also identified many undiscovered interactions between genes in the major histocompatibility complex (MHC) region in the type 1 diabetes data set. In the coming era of large-scale interaction mapping in genome-wide case-control studies, our method can serve as a computationally and statistically useful tool.

397 citations


Posted Content•
TL;DR: In this paper, a tree shape peak identification for ChIP-seq experiments is proposed, inspired by the notion of persistence in topological data analysis and provides a nonparametric approach that is robust to noise in experiments.
Abstract: We present a new algorithm for the identification of bound regions from ChIP-seq experiments. Our method for identifying statistically significant peaks from read coverage is inspired by the notion of persistence in topological data analysis and provides a non-parametric approach that is robust to noise in experiments. Specifically, our method reduces the peak calling problem to the study of tree-based statistics derived from the data. We demonstrate the accuracy of our method on existing datasets, and we show that it can discover previously missed regions and can more clearly discriminate between multiple binding events. The software T-PIC (Tree shape Peak Identification for ChIP-Seq) is available at this http URL

55 citations


Journal Article•DOI•
TL;DR: This model - which utilizes the rate-distortion theory of noisy communication channels - suggests that the genetic code originated as a result of the interplay of the three conflicting evolutionary forces: the needs for diverse amino-acids, for error-tolerance and for minimal cost of resources.
Abstract: The genetic code maps the sixty-four nucleotide triplets (codons) to twenty amino-acids. While the biochemical details of this code were unraveled long ago, its origin is still obscure. We review information-theoretic approaches to the problem of the code's origin and discuss the results of a recent work that treats the code in terms of an evolving, error-prone information channel. Our model - which utilizes the rate-distortion theory of noisy communication channels - suggests that the genetic code originated as a result of the interplay of the three conflicting evolutionary forces: the needs for diverse amino-acids, for error-tolerance and for minimal cost of resources. The description of the code as an information channel allows us to mathematically identify the fitness of the code and locate its emergence at a second-order phase transition when the mapping of codons to amino-acids becomes nonrandom. The noise in the channel brings about an error-graph, in which edges connect codons that are likely to be confused. The emergence of the code is governed by the topology of the error-graph, which determines the lowest modes of the graph-Laplacian and is related to the map coloring problem.

50 citations


Journal Article•DOI•
TL;DR: The microbial diversity at Pitch Lake was found to be unique when compared to microbial communities analyzed at other hydrocarbon-rich environments, which included Rancho Le Brea, a natural asphalt environment in California, USA, and an oil well and a mud volcano in Trinidad and Tobago, among other sites.
Abstract: An active microbiota, reaching up to 10 E+7 cells/g, was found to inhabit a naturally occurring asphalt lake characterized by low water activity and elevated temperature. Geochemical and molecular taxonomic approaches revealed novel and deeply branching microbial assemblages mediating anaerobic hydrocarbon degradation, metal respiration and C1 utilization pathways. These results open a window into the origin and adaptive evolution of microbial life within recalcitrant hydrocarbon matrices, and establish the site as a useful analog for the liquid hydrocarbon environments on Saturn's moon Titan.

47 citations


Journal Article•DOI•
TL;DR: Two-dimensional gel electrophoresis has been instrumental in the birth and development of proteomics, although it is no longer the exclusive separation tool used in the field.
Abstract: Two-dimensional gel electrophoresis has been instrumental in the birth and developments of proteomics, although it is no longer the exclusive separation tool used in the field of proteomics. In this review, a historical perspective is made, starting from the days where two-dimensional gels were used and the word proteomics did not even exist. The events that have led to the birth of proteomics are also recalled, ending with a description of the now well-known limitations of two-dimensional gels in proteomics. However, the often-underestimated advantages of two-dimensional gels are also underlined, leading to a description of how and when to use two-dimensional gels for the best in a proteomics approach. Taking support of these advantages (robustness, resolution, and ability to separate entire, intact proteins), possible future applications of this technique in proteomics are also mentioned.

19 citations


Posted Content•
TL;DR: It is concluded that no single computer program is necessarily capable of finding all of the tRNA genes in any given mitogenome, and that use of both the ARWEN and DOGMA programs is sometimes necessary because they produce complementary false negatives.
Abstract: The ability to locate and annotate mitochondrial genes is an important practical issue, given the rapidly increasing number of mitogenomes appearing in the public databases. Unfortunately, tRNA genes in Metazoan mitochondria have proved to be problematic because they often vary in number (genes missing or duplicated) and also in the secondary structure of the transcribed tRNAs (T or D arms missing). I have performed a series of comparative analyses of the tRNA genes of a broad range of Metazoan mitogenomes in order to address this issue. I conclude that no single computer program is necessarily capable of finding all of the tRNA genes in any given mitogenome, and that use of both the ARWEN and DOGMA programs is sometimes necessary because they produce complementary false negatives. There are apparently a very large number of erroneous annotations in the databased mitogenome sequences, including missed genes, wrongly annotated locations, false complements, and inconsistent criteria for assigning the 5' and 3' boundaries; and I have listed many of these. The extent of overlap between genes is often greatly exaggerated due to inconsistent annotations, although notable overlaps involving tRNAs are apparently real. Finally, three novel hypotheses were examined and found to have support from the comparative analyses: (1) some organisms have mitogenomic locations that simultaneously code for multiple tRNAs; (2) some organisms have mitogenomic locations that simultaneously code for tRNAs and proteins (but not rRNAs); and (3) one group of nematodes has several genes that code for tRNAs lacking both the D and T arms.

13 citations


Posted Content•
TL;DR: In this article, the authors consider the effect of the length distribution of fragments and introduce the notion of the shape of a coverage function, which can be used to detect abberations in coverage.
Abstract: Background: We study the statistical properties of fragment coverage in genome sequencing experiments. In an extension of the classic Lander-Waterman model, we consider the effect of the length distribution of fragments. We also introduce the notion of the shape of a coverage function, which can be used to detect abberations in coverage. The probability theory underlying these problems is essential for constructing models of current high-throughput sequencing experiments, where both sample preparation protocols and sequencing technology particulars can affect fragment length distributions. Results: We show that regardless of fragment length distribution and under the mild assumption that fragment start sites are Poisson distributed, the fragments produced in a sequencing experiment can be viewed as resulting from a two-dimensional spatial Poisson process. We then study the jump skeleton of the the coverage function, and show that the induced trees are Galton-Watson trees whose parameters can be computed. Conclusions: Our results extend standard analyses of shotgun sequencing that focus on coverage statistics at individual sites, and provide a null model for detecting deviations from random coverage in high-throughput sequence census based experiments. By focusing on fragments, we are also led to a new approach for visualizing sequencing data that should be of independent interest.

11 citations


Journal Article•DOI•
TL;DR: In this paper, the authors used microarrays to identify 23 microRNAs that are regulated by phorbol myristate acetate (PMA), and they identified four PMA-induced micro-RNAs (mir-155, mir-222, and mir-424) that when overexpressed cause cell-cycle arrest and partial differentiation and when used in combination induce additional changes not seen by any individual microRNA.
Abstract: Acute myeloid leukemia (AML) involves a block in terminal differentiation of the myeloid lineage and uncontrolled proliferation of a progenitor state. Using phorbol myristate acetate (PMA), it is possible to overcome this block in THP-1 cells (an M5-AML containing the MLL-MLLT3 fusion), resulting in differentiation to an adherent monocytic phenotype. As part of FANTOM4, we used microarrays to identify 23 microRNAs that are regulated by PMA. We identify four PMA-induced micro- RNAs (mir-155, mir-222, mir-424 and mir-503) that when overexpressed cause cell-cycle arrest and partial differentiation and when used in combination induce additional changes not seen by any individual microRNA. We further characterize these prodifferentiative microRNAs and show that mir-155 and mir-222 induce G2 arrest and apoptosis, respectively. We find mir-424 and mir-503 are derived from a polycistronic precursor mir-424-503 that is under repression by the MLL-MLLT3 leukemogenic fusion. Both of these microRNAs directly target cell-cycle regulators and induce G1 cell-cycle arrest when overexpressed in THP-1. We also find that the pro-differentiative mir-424 and mir-503 downregulate the anti-differentiative mir-9 by targeting a site in its primary transcript. Our study highlights the combinatorial effects of multiple microRNAs within cellular systems.

10 citations


Posted Content•
TL;DR: The authors used Markov Logic Networks (MLNs) as a framework for combining deterministic knowledge with statistical analysis, which can be used as a general framework for incorporating system biology with genetics.
Abstract: Complex, non-additive genetic interactions are common and can be critical in determining phenotypes. Genome-wide association studies (GWAS) and similar statistical studies of linkage data, however, assume additive models of gene interactions in looking for genotype-phenotype associations. These statistical methods view the compound effects of multiple genes on a phenotype as a sum of partial influences of each individual gene and can often miss a substantial part of the heritable effect. Such methods do not use any biological knowledge about underlying genotype-phenotype mechanisms. Modeling approaches from the AI field that incorporate deterministic knowledge into models to perform statistical analysis can be applied to include prior knowledge in genetic analysis. We chose to use the most general such approach, Markov Logic Networks (MLNs), as a framework for combining deterministic knowledge with statistical analysis. Using simple, logistic regression-type MLNs we have been able to replicate the results of traditional statistical methods. Moreover, we show that even with simple models we are able to go beyond finding independent markers linked to a phenotype by using joint inference that avoids an independence assumption. The method is applied to genetic data on yeast sporulation, a phenotype governed by non-linear gene interactions. In addition to detecting all of the previously identified loci associated with sporulation, our method is able to identify four loci with small effects. Since their effect on sporulation is small, these four loci were not detected with methods that do not account for dependence between markers due to gene interactions. We show how gene interactions can be detected using more complex models, which can be used as a general framework for incorporating systems biology with genetics.

9 citations


Posted Content•
TL;DR: In this article, the arrival time of several TFs to multiple binding sites and derive, in the presence of competitive binding ligands, the probability that several target sites are bound.
Abstract: Transcription factors (TFs) are key regulators of gene expression. Based on the classical scenario in which the TF search process switches between one-dimensional motion along the DNA molecule and free Brownian motion in the nucleus, we study the arrival time of several TFs to multiple binding sites and derive, in the presence of competitive binding ligands, the probability that several target sites are bound. We then apply our results to the hunchback regulation by bicoid in the fly embryo and we propose a general mechanism that allows cells to read a morphogenetic gradient and specialize according to their position in the embryo.

7 citations


Posted Content•
TL;DR: In this paper, a SAT-based algorithm is used to determine the gene predictor set from steady state gene expression data (attractor states) using the attractor states as input, the states are ordered into attractor cycles.
Abstract: The inference of gene predictors in the gene regulatory network has become an important research area in the genomics and medical disciplines. Accurate predicators are necessary for constructing the GRN model and to enable targeted biological experiments that attempt to confirm or control the regulation process. In this paper, we implement a SAT-based algorithm to determine the gene predictor set from steady state gene expression data (attractor states). Using the attractor states as input, the states are ordered into attractor cycles. For each attractor cycle ordering, all possible predictors are enumerated and a CNF expression is formulated which encodes these predictors and their biological constraints. Each CNF is explored using a SAT solver to find candidate predictor sets. Statistical analysis of the results selects the most likely predictor set of the GRN corresponding to the attractor data. We demonstrate our algorithm on attractor state data from a melanoma study, and present our predictor set results.

Journal Article•DOI•
Noa Sela1, Eddo Kim1, Gil Ast1•
TL;DR: In this paper, the authors analyzed the influence of transposable elements (TEs) on mammalian transcriptomes through various mechanisms such as exonization and intronization (the birth of new exons/introns from previously intronic/exonic sequences, and insertion into first and last exons).
Abstract: Background: Transposable elements (TEs) have played an important role in the diversification and enrichment of mammalian transcriptomes through various mechanisms such as exonization and intronization (the birth of new exons/introns from previously intronic/exonic sequences, respectively), and insertion into first and last exons. However, no extensive analysis has compared the effects of TEs on the transcriptomes of mammalian, non-mammalian vertebrates and invertebrates. Results: We analyzed the influence of TEs on the transcriptomes of five species, three invertebrates and two non-mammalian vertebrates. Compared to previously analyzed mammals, there were lower levels of TE introduction into introns, significantly lower numbers of exonizations originating from TEs and a lower percentage of TE insertion within the first and last exons. Although the transcriptomes of vertebrates exhibit a significant level of exonizations of TEs, only anecdotal cases were found in invertebrates. In vertebrates, as in mammals, the exonized TEs are mostly alternatively spliced, indicating selective pressure maintains the original mRNA product generated from such genes. Conclusions: Exonization of TEs is wide-spread in mammals, less so in non- mammalian vertebrates, and very low in invertebrates. We assume that the exonization process depends on the length of introns. Vertebrates, unlike invertebrates, are characterized by long introns and short internal exons. Our results suggest that there is a direct link between the length of introns and exonization of TEs and that this process became more prevalent following the appearance of mammals.

Posted Content•
TL;DR: CGHTRIMMER provides a new alternative for the problem of aCGH discretization that provides superior detection of fine-scale regions of gain or loss yet is fast enough to process very large data sets in seconds, meeting an important need for methods capable of handling the vast amounts of data being accumulated in high-throughput studies of tumor genetics.
Abstract: The development of cancer is largely driven by the gain or loss of subsets of the genome, promoting uncontrolled growth or disabling defenses against it. Identifying genomic regions whose DNA copy number deviates from the normal is therefore central to understanding cancer evolution. Array-based comparative genomic hybridization (aCGH) is a high-throughput technique for identifying DNA gain or loss by quantifying total amounts of DNA matching defined probes relative to healthy diploid control samples. Due to the high level of noise in microarray data, however, interpretation of aCGH output is a difficult and error-prone task. In this work, we tackle the computational task of inferring the DNA copy number per genomic position from noisy aCGH data. We propose CGHTRIMMER, a novel segmentation method that uses a fast dynamic programming algorithm to solve for a least-squares objective function for copy number assignment. CGHTRIMMER consistently achieves superior precision and recall to leading competitors on benchmarks of synthetic data and real data from the Coriell cell lines. In addition, it finds several novel markers not recorded in the benchmarks but plausibly supported in the oncology literature. Furthermore, CGHTRIMMER achieves superior results with run-times from 1 to 3 orders of magnitude faster than its state-of-art competitors. CGHTRIMMER provides a new alternative for the problem of aCGH discretization that provides superior detection of fine-scale regions of gain or loss yet is fast enough to process very large data sets in seconds. It thus meets an important need for methods capable of handling the vast amounts of data being accumulated in high-throughput studies of tumor genetics.

Posted Content•
TL;DR: It is demonstrated that the transition from fluctuation to order takes place at about sequence length 200-300 thousands bases for human and E coli genome and the sum rule Q(k,N) increases with length N at a constant rate for most genome sequences and is correlated with the evolutionary complexity of the genome.
Abstract: Sequence organizations are viewed from two points: one is from informational redundancy or informational correlation (IC) and another is from k-mer frequency statistics. Two problems are investigated. The first is how the ICs exceed the fluctuation bound and the order emerges from fluctuation in a genome when the sequence length attains some critical value. We demonstrated that the transition from fluctuation to order takes place at about sequence length 200-300 thousands bases for human and E coli genome. It means that the life emerges from a region between macroscopic and microscopic. The second is about the statistical law of the k-mer organization in a genome under the evolutionary pressure and functional selection. We deduced a sum rule Q(k,N) on the k-mer frequency deviations from the randomness in a N-long sequence of genome and deduced the relations of Q(k,N) with k and N. We found that Q(k,N) increases with length N at a constant rate for most genome sequences and demonstrated that when the functional selection of k-mers is accumulated to some critical value the ordering takes place. An important finding is the sum rule correlated with the evolutionary complexity of the genome.

Posted Content•
TL;DR: The combined influence of amino acid composition and chain length on the thermal stability of protein structures is studied and a new parameterization of the internal free energy is considered, as the sum of hydrophobic effect, hydrogen-bond and de-hydration energy terms.
Abstract: We study the combined influence of amino acid composition and chain length on the thermal stability of protein structures. A new parameterization of the internal free energy is considered, as the sum of hydrophobic effect, hydrogen-bond and de-hydration energy terms. We divided a non-redundant selection of protein structures from the Protein Data Bank into three groups: i) rich in order-promoting residues (OPR proteins); ii) rich in disorder-promoting residues (DPR proteins); iii) belonging to a twilight zone (TZ proteins). We observe a partition of PDB in several groups with different internal free energies, amino acid compositions and protein lengths. Internal free energy of 96% of the proteins analyzed ranges from -2 to -6.5 kJ/mol/res. We found many DPR and OPR proteins with the same relative thermal stability. Only OPR proteins with internal energy between -4 and -6.5 kJ/mol/res are observed to have chains longer than 200 residues, with a high de-hydration energy compensated by the hydrophobic effect. DPR and TZ proteins are shorter than 200 residues and they have an internal energy above -4 kJ/mol/res, with a few exceptions among TZ proteins. Hydrogen-bonds play an important role in the stabilization of these DPR folds, often higher than contact energy. The new parameterization of internal free energy let emerge a geography of thermal stabilities of PDB structures. Amino acid composition per se is not sufficient to determine the stability of protein folds, since. DPR and TZ proteins generally have a relatively high internal free energy, and they are stabilized by hydrogen-bonds. Long DPR proteins are not observed in the PDB, because their low hydrophobicity cannot compensate the high de-hydration energy necessary to accommodate residues within a highly packed globular fold.

Posted Content•
TL;DR: The observations support the idea that the spread of the unfoldome has been often overestimated and the structure-function paradigm is generally valid and pre-existing stereo-chemical complementarity among structures remains an important requisite for interactions between biological macromolecules.
Abstract: The term unfoldome has been recently used to indicate the universe of intrinsically disordered proteins. These proteins are characterized by an ensemble of high-flexible interchangeable conformations and therefore they can interact with many targets without requiring pre-existing stereo-chemical complementarity. It has been suggested that intrinsically disordered proteins are frequent in proteomes and disorder is widespread also in structured proteins. However, several studies raise some doubt about these views. It this paper we estimate the frequency of intrinsically disordered proteins in several living organisms by using the ratio S between the likelihood, for a protein sequence, of being composed mainly by order-promoting or disorder-promoting residues. We scan several proteomes from Archaea, Bacteria and Eukarya. We find the following figures: 1.63% for Archaea, 3.91% for Bacteria, 16.35% for Eukarya. The frequencies we found can be considered an upper bound to the real frequency of intrinsically disordered proteins in proteomes. Our estimates are lower than those previously reported in several studies. A scanning of proteins in the Protein Data Bank (PDB) searching for segments of non-observed residues reveals that segments of non-observed residues longer than 30 amino acids, are rare. Our observations support the idea that the spread of the unfoldome has been often overestimated. If we exclude some exceptions, the structure-function paradigm is generally valid and pre-existing stereo-chemical complementarity among structures remains an important requisite for interactions between biological macromolecules.

Posted Content•
TL;DR: PIntron as discussed by the authors uses maximal embeddings that are sequences obtained from paths of a graph structure, called Embedding Graph, whose vertices are the maximal pairings of a genomic sequence T and an EST P. The procedure runs in time linear in the size of P, T and of the output.
Abstract: Current computational methods for exon-intron structure prediction from a cluster of transcript (EST, mRNA) data do not exhibit the time and space efficiency necessary to process large clusters of over than 20,000 ESTs and genes longer than 1Mb. Guaranteeing both accuracy and efficiency seems to be a computational goal quite far to be achieved, since accuracy is strictly related to exploiting the inherent redundancy of information present in a large cluster. We propose a fast method for the problem that combines two ideas: a novel algorithm of proved small time complexity for computing spliced alignments of a transcript against a genome, and an efficient algorithm that exploits the inherent redundancy of information in a cluster of transcripts to select, among all possible factorizations of EST sequences, those allowing to infer splice site junctions that are highly confirmed by the input data. The EST alignment procedure is based on the construction of maximal embeddings that are sequences obtained from paths of a graph structure, called Embedding Graph, whose vertices are the maximal pairings of a genomic sequence T and an EST P. The procedure runs in time linear in the size of P, T and of the output. PIntron, the software tool implementing our methodology, is able to process in a few seconds some critical genes that are not manageable by other gene structure prediction tools. At the same time, PIntron exhibits high accuracy (sensitivity and specificity) when compared with ENCODE data. Detailed experimental data, additional results and PIntron software are available at this http URL

Posted Content•
TL;DR: In this paper, the authors used entropy compressed or succinct data structures to create a practical representation of the de Bruijn assembly graph, which requires at least a factor of 10 less storage than the kinds of structures used by deployed methods.
Abstract: Motivation: Second generation sequencing technology makes it feasible for many researches to obtain enough sequence reads to attempt the de novo assembly of higher eukaryotes (including mammals). De novo assembly not only provides a tool for understanding wide scale biological variation, but within human bio-medicine, it offers a direct way of observing both large scale structural variation and fine scale sequence variation. Unfortunately, improvements in the computational feasibility for de novo assembly have not matched the improvements in the gathering of sequence data. This is for two reasons: the inherent computational complexity of the problem, and the in-practice memory requirements of tools. Results: In this paper we use entropy compressed or succinct data structures to create a practical representation of the de Bruijn assembly graph, which requires at least a factor of 10 less storage than the kinds of structures used by deployed methods. In particular we show that when stored succinctly, the de Bruijn assembly graph for homo sapiens requires only 23 gigabytes of storage. Moreover, because our representation is entropy compressed, in the presence of sequencing errors it has better scaling behaviour asymptotically than conventional approaches.

Posted Content•
TL;DR: A new method is proposed to build the large scale DNA sequences search system based on web search engine technology that is able to provide the ms level search services for billions of DNA sequences in a typical server.
Abstract: This paper proposed a new method to build the large scale DNA sequences search system based on web search engine technology. We give a very brief introduction for the methods used in search engine first. Then how to build a DNA search system like Google is illustrated in detail. Since there is no local alignment process, this system is able to provide the ms level search services for billions of DNA sequences in a typical server.

Posted Content•
TL;DR: These results reveal and characterize a new type of local chromatin structure in yeast that is strongly nucleosome-depleted and preferentially targeted by chromatin-remodeling complexes and the origin-of-replication complex (ORC).
Abstract: Transcription factors (TF) play an essential role in the cell as locus- and condition-specific recruiters of transcriptional machinery or chromatin-modifying complexes. However, predicting the in vivo profile of TF occupancy along the genome, which depends on complex interactions with other chromatin-associated proteins, from the DNA sequence remains a major challenge. Through careful reanalysis of ChIP-chip data for 138 TFs obtained in rich media, we were able to classify the upstream promoter regions of S. cerevisiae into 15 distinct chromatin types. One of these encompasses 5% of all promoters and is unique in that it is highly occupied by (essentially) all TFs expressed in rich media. These "hotspots" of TF occupancy are strongly nucleosome-depleted and preferentially targeted by chromatin-remodeling complexes and the origin-of-replication complex (ORC). They are also the only chromatin type enriched for predicted Rap1p and Pdr1p binding sites, which we found to work cooperatively with AAA/TTT motifs, known to affect local DNA structure, to reduce nucleosome occupancy. Taken together, our results reveal and characterize a new type of local chromatin structure in yeast.

Posted Content•
TL;DR: A novel approach for reconstruction of the composition of an unknown mixture of bacteria using a single Sanger-sequencing reaction of the mixture based on compressive sensing theory, which may have a potential for a practical and efficient way for identifying bacterial species compositions in biological samples.
Abstract: Bacteria are the unseen majority on our planet, with millions of species and comprising most of the living protoplasm. While current methods enable in-depth study of a small number of communities, a simple tool for breadth studies of bacterial population composition in a large number of samples is lacking. We propose a novel approach for reconstruction of the composition of an unknown mixture of bacteria using a single Sanger-sequencing reaction of the mixture. This method is based on compressive sensing theory, which deals with reconstruction of a sparse signal using a small number of measurements. Utilizing the fact that in many cases each bacterial community is comprised of a small subset of the known bacterial species, we show the feasibility of this approach for determining the composition of a bacterial mixture. Using simulations, we show that sequencing a few hundred base-pairs of the 16S rRNA gene sequence may provide enough information for reconstruction of mixtures containing tens of species, out of tens of thousands, even in the presence of realistic measurement noise. Finally, we show initial promising results when applying our method for the reconstruction of a toy experimental mixture with five species. Our approach may have a potential for a practical and efficient way for identifying bacterial species compositions in biological samples.

Posted Content•
TL;DR: This discussion is about the BAK1 gene variants and the response from Dr. Gottlieb and his co-authors seems to be unsatisfactory for the reasons listed below.
Abstract: Dr. Hatchwell [2010] has proposed that the BAK1 gene variants were likely due to sequencing of a processed gene on chromosome 20. However, in response, Dr. Gottlieb and co-authors [2010] have argued that "some but not all of the sequence changes present in the BAK1 sequence of our abdominal aorta samples are also present in the chromosome 20 BAK1 sequence. However, all the AAA and AA cDNA samples are identical to each other and different from chromosome 20 BAK1 sequence at amino acids 2 and 145". I have been following this discussion because I have independently reached almost the same conclusion as Dr. Hatchwell did [Yamagishi, 2009], and, unfortunately, the response from Dr. Gottlieb and his co-authors seems to me to be unsatisfactory for the reasons listed below

Journal Article•DOI•
TL;DR: A complete human proteome project (HPP) appears feasible for the first time as mentioned in this paper, however, there is still debate as to how it should be designed and what it should encompass, and the debate revolves around whether a gene-centric or a protein-centric proteomics approach is the most appropriate way forward.
Abstract: With the recent developments in proteomic technologies, a complete human proteome project (HPP) appears feasible for the first time. However, there is still debate as to how it should be designed and what it should encompass. In "proteomics speak", the debate revolves around the central question as to whether a gene-centric or a protein-centric proteomics approach is the most appropriate way forward. In this paper, we try to shed light on what these definitions mean, how large-scale proteomics such as a HPP can insert into the larger omics chorus, and what we can reasonably expect from a HPP in the way it has been proposed so far.

Posted Content•
TL;DR: The hypothesis that backbone interactions play a fundamental role in the stabilization of protein structures is supported, however, the role of long-range interactions and its relation with protein length must be further investigated.
Abstract: Amino acid composition is an important determinant of protein structures. In this paper we investigate the relationship between amino acid composition and mechanical stability of protein sequences. We divide the protein structures deposited in the Protein Data Bank (PDB) as ordered, disordered and in the twilight zone, depending on their amino acid composition. We use a consensus score SSU among three predictors of global disorder, Poodle-W, gVSL2 and mean pairwise energy. Mechanical stability is evaluated through Miyazawa-Jernigan potential. We find that the three groups of protein sequences have different contact energy, disordered sequences being the most unstable and ordered ones being the most stable. Secondary structure energy and global mechanical stability, on the other hand, are about the same in the three groups of proteins, pointing to a fundamental role of backbone interactions in the stabilization of the tertiary structure. Proteins with relative high contact energy tend to remain short in length and they do not enrich in disorder-promoting amino acids. Moreover, several short proteins in the twilight zone compensate their relative instability through disulfide bridges. Our results support the hypothesis that backbone interactions play a fundamental role in the stabilization of protein structures. However, the role of long-range interactions and its relation with protein length must be further investigated. It is necessary to develop a more fundamental theory to understand the exact relation between amino acid composition and the mechanical stability of protein sequences.

Journal Article•DOI•
TL;DR: In this paper, a review of the most important variations of SDS electrophoresis is presented, so that the readers can be aware of how they can improve or tune protein separations according to their needs.
Abstract: Electrophoretic separations of proteins are widely used in proteomic analyses, and rely heavily on SDS electrophoresis. This mode of separation is almost exclusively used when a single dimension separation is performed, and generally represents the second dimension of two-dimensional separations. Electrophoretic separations for proteomics use robust, well-established protocols. However, many variations in almost all possible parameters have been described in the literature over the years, and they may bring a decisive advantage when the limits of the classical protocols are reached. The purpose of this article is to review the most important of these variations, so that the readers can be aware of how they can improve or tune protein separations according to their needs. The chemical variations reviewed in this paper encompass gel structure, buffer systems and detergents for SDS electrophoresis, two-dimensional electrophoresis based on isoelectric focusing and two-dimensional electrophoresis based on cationic zone electrophoresis.

Posted Content•
Amin Zia1, Alan M. Moses1•
TL;DR: In this article, the theoretical dependence of false positives on dataset size was derived and it was shown that false positives can arise as a result of large dataset size, irrespective of the algorithm used, and the false positive strength depends more on the number of sequences in the dataset than it does on the sequence length.
Abstract: Detection of false-positive motifs is one of the main causes of low performance in motif finding methods. It is generally assumed that false-positives are mostly due to algorithmic weakness of motif-finders. Here, however, we derive the theoretical dependence of false positives on dataset size and find that false positives can arise as a result of large dataset size, irrespective of the algorithm used. Interestingly, the false-positive strength depends more on the number of sequences in the dataset than it does on the sequence length. As expected, false-positives can be reduced by decreasing the sequence length or by adding more sequences to the dataset. The dependence on number of sequences, however, diminishes and reaches a plateau after which adding more sequences to the dataset does not reduce the false-positive rate significantly. Based on the theoretical results presented here, we provide a number of intuitive rules of thumb that may be used to enhance motif-finding results in practice.

Book Chapter•DOI•
TL;DR: It is proved that for {\alpha} \in (1,2], a minimum-weight transformation may entirely consist of transpositions, implying that the corresponding weighted genomic distance does not actually achieve its purpose of bounding the proportion ofTranspositions.
Abstract: Genomic distance between two genomes, i.e., the smallest number of genome rearrangements required to transform one genome into the other, is often used as a measure of evolutionary closeness of the genomes in comparative genomics studies. However, in models that include rearrangements of significantly different "power" such as reversals (that are "weak" and most frequent rearrangements) and transpositions (that are more "powerful" but rare), the genomic distance typically corresponds to a transformation with a large proportion of transpositions, which is not biologically adequate. Weighted genomic distance is a traditional approach to bounding the proportion of transpositions by assigning them a relative weight {\alpha} > 1. A number of previous studies addressed the problem of computing weighted genomic distance with {\alpha} \leq 2. Employing the model of multi-break rearrangements on circular genomes, that captures both reversals (modelled as 2-breaks) and transpositions (modelled as 3-breaks), we prove that for {\alpha} \in (1,2], a minimum-weight transformation may entirely consist of transpositions, implying that the corresponding weighted genomic distance does not actually achieve its purpose of bounding the proportion of transpositions. We further prove that for {\alpha} \in (1,2), the minimum-weight transformations do not depend on a particular choice of {\alpha} from this interval. We give a complete characterization of such transformations and show that they coincide with the transformations that at the same time have the shortest length and make the smallest number of breakages in the genomes. Our results also provide a theoretical foundation for the empirical observation that for {\alpha} < 2, transpositions are favored over reversals in the minimum-weight transformations.

Book Chapter•DOI•
Abstract: An important question in genome evolution is whether there exist fragile regions (rearrangement hotspots) where chromosomal rearrangements are happening over and over again. Although nearly all recent studies supported the existence of fragile regions in mammalian genomes, the most comprehensive phylogenomic study of mammals (Ma et al. (2006) Genome Research 16, 1557-1565) raised some doubts about their existence. We demonstrate that fragile regions are subject to a "birth and death" process, implying that fragility has limited evolutionary lifespan. This finding implies that fragile regions migrate to different locations in different mammals, explaining why there exist only a few chromosomal breakpoints shared between different lineages. The birth and death of fragile regions phenomenon reinforces the hypothesis that rearrangements are promoted by matching segmental duplications and suggests putative locations of the currently active fragile regions in the human genome.

Journal Article•DOI•
TL;DR: In this article, the authors derived asymptotic error rates of prediction procedures based on WM and Gibbs free energy (FE) models under different data generation assumptions and demonstrated that the FE approach shows higher or comparable predictive power relative to the WM approach when the number of observed binding sites used for constructing a discriminant decision is not too small.
Abstract: The problem of motif detection can be formulated as the construction of a discriminant function to separate sequences of a specific pattern from background. In computational biology, motif detection is used to predict DNA binding sites of a transcription factor (TF), mostly based on the weight matrix (WM) model or the Gibbs free energy (FE) model. However, despite the wide applications, theoretical analysis of these two models and their predictions is still lacking. We derive asymptotic error rates of prediction procedures based on these models under different data generation assumptions. This allows a theoretical comparison between the WM-based and the FE-based predictions in terms of asymptotic efficiency. Applications of the theoretical results are demonstrated with empirical studies on ChIP-seq data and protein binding microarray data. We find that, irrespective of underlying data generation mechanisms, the FE approach shows higher or comparable predictive power relative to the WM approach when the number of observed binding sites used for constructing a discriminant decision is not too small.

Posted Content•
TL;DR: Eleven protein targets common to the four stages of the parasite life cycle are suggested to be used as possible vaccines against Plasmodium falciparum causative agent.
Abstract: In this paper, we suggested eleven protein targets to be used as possible vaccines against Plasmodium falciparum causative agent of almost two to three million deaths per year. A comprehensive analysis of protein target have been selected from the small experimental fragment of antigen in the P. falciparum genome, all of them common to the four stages of the parasite life cycle (i.e., sporozoites, merozoites, trophozoites and gametocytes). The potential vaccine candidates should be analyzed in silico technique using various bioinformatics tools. Finally, the possible protein target according to PlasmoDB gene ID are PFC0975c, PFE0660c, PF08_0071, PF10_0084, PFI0180w, MAL13P1.56, PF14_0192, PF13_0141, PF14_0425, PF13_0322, y PF14_0598.