scispace - formally typeset
Search or ask a question

Showing papers presented at "International Conference on Bioinformatics in 2007"


Proceedings ArticleDOI
10 Jun 2007
TL;DR: Almost all existing methods are compared for the purpose of identifying protein coding regions, using the discrete Fourier transform (DFT) based spectral content measure to exploit period-3 behaviour in the exonic regions for the GENSCAN test set.
Abstract: Processing of DNA sequences using traditional digital signal processing methods requires their conversion from a character string into numerical sequences as a first step. Many representations introduced previously assign values to indicate the four DNA nucleotides A, C, G, and T that impose mathematical structures not present in the actual DNA sequence. In this paper, almost all existing methods are compared for the purpose of identifying protein coding regions, using the discrete Fourier transform (DFT) based spectral content measure to exploit period-3 behaviour in the exonic regions for the GENSCAN test set. False positive vs. sensitivity, receiver operating characteristic (ROC) curve and exonic nucleotides detected as false positive results all show that the two newly proposed numerical of DNA representations perform better than the well-known Z-curve, tetrahedron, and Voss representations, with 66-75% less processing. By comparison with Voss representation, the proposed paired numeric method can produce relative improvements of up to 12% in terms of prediction accuracy of exonic nucleotides at a 10% false positive rate using the GENSCAN test set.

103 citations


Journal ArticleDOI
01 Jun 2007
TL;DR: An integrated multi- SNP, multi-array genotype calling algorithm for Affymetrix SNP arrays, MAMS, that combines single-array multi-SNP (SAMS) and multi- array, single- SNPs (MASS) calls to improve the accuracy of genotype calls, without the need for training data or computation-intensive normalization procedures as in other multi- arrays methods is developed.
Abstract: Motivation: Modern strategies for mapping disease loci require efficient genotyping of a large number of known polymorphic sites in the genome. The sensitive and high-throughput nature of hybridization-based DNA microarray technology provides an ideal platform for such an application by interrogating up to hundreds of thousands of single nucleotide polymorphisms (SNPs) in a single assay. Similar to the development of expression arrays, these genotyping arrays pose many data analytic challenges that are often platform specific. Affymetrix SNP arrays, e.g. use multiple sets of short oligonucleotide probes for each known SNP, and require effective statistical methods to combine these probe intensities in order to generate reliable and accurate genotype calls. Results: We developed an integrated multi-SNP, multi-array genotype calling algorithm for Affymetrix SNP arrays, MAMS, that combines single-array multi-SNP (SAMS) and multi-array, single-SNP (MASS) calls to improve the accuracy of genotype calls, without the need for training data or computation-intensive normalization procedures as in other multi-array methods. The algorithm uses resampling techniques and model-based clustering to derive single array based genotype calls, which are subsequently refined by competitive genotype calls based on (MASS) clustering. The resampling scheme caps computation for single-array analysis and hence is readily scalable, important in view of expanding numbers of SNPs per array. The MASS update is designed to improve calls for atypical SNPs, harboring allele-imbalanced binding affinities, that are difficult to genotype without information from other arrays. Using a publicly available data set of HapMap samples from Affymetrix, and independent calls by alternative genotyping methods from the HapMap project, we show that our approach performs competitively to existing methods. Availability: R functions are available upon request from the authors. Contact:yxiao@itsa.ucsf.edu and rufang@biostat.ucsf.edu Supplementary information: Supplementary data are available at Bioinformatics online.

61 citations


Proceedings Article
01 Jan 2007
TL;DR: In this paper, a novel quality control algorithm based on target sequence mapping onto genome and GeneChip expression data analysis was proposed to evaluate the impact of erroneous and poorly annotated target sequences on the quality of gene expression data.
Abstract: Careful analysis of microarray probe design should be an obligatory component of MicroArray Quality Control (MACQ) project [Patterson et al., 2006; Shi et al., 2006] initiated by the FDA (USA) in order to provide quality control tools to researchers of gene expression profiles and to translate the microarray technology from bench to bedside. The identification and filtering of unreliable probesets are important preprocessing steps before analysis of microarray data. These steps may result in an essential improvement in the selection of differentially expressed genes, gene clustering and construction of co-regulatory expression networks. We revised genome localization of the Affymetrix U133A&B GeneChip initial (target) probe sequences, and evaluated the impact of erroneous and poorly annotated target sequences on the quality of gene expression data. We found about 25% of Affymetrix target sequences overlapping with interspersed repeats that could cause cross-hybridization effects. In total, discrepancies in target sequence annotation account for up to ∼30% of 44692 Affymetrix probesets. We introduce a novel quality control algorithm based on target sequence mapping onto genome and GeneChip expression data analysis. To validate the quality of probesets we used expression data from large, clinically and genetically distinct groups of breast cancers (249 samples). For the first time, we quantitatively evaluated the effect of repeats and other sources of inadequate probe design on the specificity, reliability and discrimination ability of Affymetrix probesets. We propose that only functionally reliable Affymetrix probesets that passed our quality control algorithm (∼86%) for gene expression analysis should be utilized. The target sequence annotation and filtering is available upon request.

24 citations


Proceedings ArticleDOI
10 Jun 2007
TL;DR: A robust version of the Fisher's test for detecting hidden periodicities in uniformly sampled time series data is presented and performs better than the original test in case the data is not truly Gaussianly distributed.
Abstract: Periodicity detection in time series measurements is a usual application of signal processing in studying biological data. The reasons for detecting periodically behaving biological events are many, e.g. periodicity in gene expression time series could suggest cell cycle control over the gene expression. In this paper we present a robust version of the Fisher's test for detecting hidden periodicities in uniformly sampled time series data. The robust test performs better than the original test in case the data is not truly Gaussianly distributed. The proposed robust method is nearly as fast to evaluate as the original Fisher's test.

18 citations


Proceedings Article
01 Jan 2007
TL;DR: It has been demonstrated that horizontal transfer can be preserved by selection along evolution even without "selfish genes", which extends the applicability of this tool to various processes of information transduction among populations, provided that these processes resemble horizontal gene transfer.
Abstract: An original modeling tool called Evolutionary Constructor has been proposed and described. Evolutionary Constructor combines the advantages of both generalized and portrait modeling and, additionally, provides an option to modify a current model's structure. The evolution of communities comprising atrophic ring-like network with the horizontal transfer of metabolism genes occurring among the communities has been modeled and presented. It has been demonstrated that a prolonged increase in the fitness of any single population that forms part of that trophic ring-like network of antagonistic communities will eventually lead that system to becoming absolutely dependent on environmental fluctuations. This result challenges the intuitive attitudes that the higher population fitness, the more stability is given to that population. Modeling of a system comprised by symbiotic communities has revealed that horizontal transfer confers a selective advantage not only on the acceptor population (which is up to expectations) but also on the donor population. It has therefore been demonstrated that horizontal transfer can be preserved by selection along evolution even without "selfish genes". Evolutionary Constructor can handle any phenotypic trait that is controlled genetically, epigenetically, etc., which extends the applicability of this tool to various processes of information transduction among populations, provided that these processes resemble horizontal gene transfer.

16 citations


Proceedings ArticleDOI
10 Jun 2007
TL;DR: This paper proposes a Sigma filter algorithm, which has superior performance for denoising DNA copy number aberrations and low computational complexity, and presents a comparison study between this approach and other smoothing and statistical approaches, the wavelet-based, LookAhead, CGH segmentation and HMM.
Abstract: DNA copy number aberrations are characteristic of many genomic diseases including cancer. Microrray-based comparative genomic hybridization (aCGH) is a recently developed high-throughput technique to map and detect DNA copy number (DCN) aberrations. Unfortunately, the observed copy number changes are corrupted by noise, making aberration boundaries hard to detect. They may have false positive or may miss true positive breakpoints. As a result, many approaches proposed to eliminate fluctuations on the DCN data within each aberrant interval and to preserve edges across them. In this paper, we propose a Sigma filter algorithm, because it has superior performance for denoising such data and low computational complexity. We present a comparison study between our approach and other smoothing and statistical approaches, the wavelet-based, LookAhead, CGH segmentation and HMM. Finally, we provide examples using real data sets to illustrate the performance of the algorithms.

16 citations


Proceedings Article
01 Jan 2007
TL;DR: Evidence was obtained that the evolutionary history of the p63/p73 proteins has been under positive selection, and an attempt is made to associate the current evidence with the previous for positive selection in the p53 family.
Abstract: Proteins of the relative families p53 and p63/p73 are transcriptional factors that are involved in the signaling pathway in cells. The wide spectrum of their functions includes cell cycle arrest and apoptosis in response to DNA damage. The p53 protein also participates in development of particular tissues during embryogenesis. Thus, it is of high importance to establish the relation between structure, function and evolution of these proteins. In the current computational study, the evolutionary mode of the p63/p73 protein family is investigated. Search for the adaptive branches of the phylogenetic tree and the adaptive codons in the nucleotide sequences was performed using the codem1 program from the PAML package, version 3.14. The results obtained were compared with those of our previous phylogenetic analysis of the p53 protein. Evidence was obtained that the evolutionary history of the p63/p73 proteins has been under positive selection. An attempt is made to associate the current evidence with the previous for positive selection in the p53 family. Recently the G245C substitution has been assumed to result in formation of a novel Zn(2+)-binding site in the p53 protein. The molecular mechanics simulations were performed to estimate energy of zinc binding to its site in two dominant-negative p53 mutants--G245C and R175H--in comparison with the wild-type p53. The results of the estimation provided evidence of the novel Zn(2+)-binding site functionality in G245C mutant form.

15 citations


Proceedings ArticleDOI
10 Jun 2007
TL;DR: A novel approach is presented to the detection of homological, eroded and latent periodicities in DNA sequences by assuming each symbol in a DNA sequence to be generated from an information source with an underlying probability mass function in a cyclic manner.
Abstract: A novel approach is presented to the detection of homological, eroded and latent periodicities in DNA sequences. Each symbol in a DNA sequence is assumed to be generated from an information source with an underlying probability mass function (pmf) in a cyclic manner. The number of sources can be interpreted as the periodicity of the sequence. The maximum likelihood estimates are developed for the pmfs of the information sources as well as the period of the DNA sequence. The statistical model can also be utilized for building probabilistic representations of RNA families.

12 citations


Proceedings ArticleDOI
10 Jun 2007
TL;DR: A novel threshold logic gene regulatory model is proposed, demonstrated to be powerful enough to explain gene interaction and cellular processes and a novel programmable hardware implementation to speed up the gene network simulation is presented.
Abstract: Gene regulation is an important modeling problem in biology. The deluge of data generated by improved techniques of gene sequencing will not be of much use until we develop accurate and efficient gene regulatory network (GRN) models. In this paper a novel threshold logic gene regulatory model is proposed. This model has been demonstrated to be powerful enough to explain gene interaction and cellular processes. A novel programmable hardware implementation to speed up the gene network simulation is presented. Some insights into the extension of this model are provided.

11 citations


Proceedings ArticleDOI
Gail L. Rosen1
10 Jun 2007
TL;DR: It is shown that it is possible to use AR measures to detect mutation-prone approximate matches by increasing the AR model order and the Euclidean distance using the binary SW mapping distinguishes perfect matches the best.
Abstract: It has been shown that DNA sequences can be modeled with autoregressive processes and that the Euclidean distance between model parameters is useful for detecting sequence similarity. But, the measure's robustness to nonexact, approximate matches is not explored. We go one step further and not only look at exact gene searching, but how the AR distance measures are perturbed by errors and mutation. To achieve higher accuracy in similarity searching, we compare the performance of the Euclidean distance measure to Itakura distance measure using different nucleotide mappings. The numerical mappings and distance measures have comparable performance, but in general, the Euclidean distance using the binary SW mapping distinguishes perfect matches the best. Finally, we show that it is possible to use AR measures to detect mutation-prone approximate matches by increasing the AR model order.

10 citations


Proceedings Article
01 Jan 2007
TL;DR: In this paper, structural and phylogenetic analyses of tRNA pairs with complementary anticodons with complementary second bases in the acceptor stem have been performed, and it has been shown that it might have been a double, sense-antisense, in-frame translation of the very first protein-encoding genes that directed the code's earliest expansion.
Abstract: The updated structural and phylogenetic analyses of tRNA pairs with complementary anticodons provide independent support for our earlier finding, namely that these tRNA pairs concertedly show complementary second bases in the acceptor stem. Two implications immediately follow: first, that a tRNA molecule gained its present, complete, cloverleaf shape via duplication(s) of a shorter precursor. Second, that common ancestry is shared by two major components of the genetic code within the tRNA molecule--the classic code per se embodied in anticodon triplets, and the operational code of aminoacylation embodied primarily in the first three base pairs of the acceptor stems. In this communication we show that it might have been a double, sense-antisense, in-frame translation of the very first protein-encoding genes that directed the code's earliest expansion, thus preserving this fundamental dual-complementary link between acceptors and anticodons. Furthermore, the dual complementarity appears to be consistent with two mirror-symmetrical modes by which class I and II aminoacyl-tRNA synthetases recognize the cognate tRNAs--from the minor and major groove side of the acceptor stem, respectively.

Proceedings ArticleDOI
10 Jun 2007
TL;DR: The implementation of clustering and rendering of spectra as video to improve unsupervised spectral analysis is reported, which stretches the definition of a sequence repeat by identifying shared periodicities between sequences independent of actual nucleotide composition.
Abstract: As applied to genomic sequence, spectral analysis is the application of Fourier transforms to binary indicator sequences derived from DNA sequence. After conversion to a color representation, the composition and harmonic properties of a sequence can be visualized. In previous work, these spectra have been used to identify sequence periodicities ranging from small to very large, both known and novel. Spectral analysis also stretches the definition of a sequence repeat by identifying shared periodicities between sequences independent of actual nucleotide composition. Here, we report the implementation of clustering and rendering of spectra as video to improve unsupervised spectral analysis.

Proceedings Article
01 Jan 2007
TL;DR: In this article, the authors analyzed the evolution of 9 genes involved in the function of the Hh signaling cascade and found that positive selection is a characteristic feature of the protein domains encoded by gene regions, whose functions are related to the molecular mechanisms of development.
Abstract: Positive selection of genes that comprise signaling cascades and play the paramount role in the development of multicellular organisms is critical to our understanding of the reasons for the evolution of embryonic development. In this work, we analyze the evolution of 9 genes involved in the function of the Hh signaling cascade. We demonstrated that positive selection is a characteristic feature of the protein domains, encoded by gene regions, whose functions are related to the molecular mechanisms of development. We also found that the positive selection of Hh-signaling cascade transcription factors, morphogens, their development-related receptors and intracellular signal transduction factors are related to the divergence of the Bilateria taxonomic types.

Proceedings Article
01 Jan 2007
TL;DR: In this article, the authors applied population dynamic approach enhanced with simulation of the fate of neutrally evolving DNA sequences included into each individual in the computer experiment to the case of the Baikalian endemic polychaetes Manayunkia.
Abstract: In this work, we apply population dynamic approach enhanced with simulation of the fate of neutrally evolving "DNA sequences" included into each individual in the computer experiment to the case of the Baikalian endemic polychaetes Manayunkia. These animals inhabit a narrow littoral zone around whole the lake perimeter and are of very limited mobility. Accordingly, the general model was modified by addition of a "geographic barrier" of different isolating power and length of existence. Using this model, we simulated the process of genetic differentiation of groups in this organism taking into account isolation by distance and geographical barriers. Wright's F st test was used in order to estimate of genetic flow. Relevant sample sizes were estimated, which would produce most important population parameters of the precision required to describe micro-evolutionary processes in Manayunkia.

Proceedings ArticleDOI
10 Jun 2007
TL;DR: The Stochastic Complexity (SC) whose computation is discussed for the generalized Gaussian distribution is proposed and the relationship between SC, the well-known Minimum Description Length (MDL) formula, and the Bayesian Information Criterion is investigated.
Abstract: The problem we address in this study is to decide, based on the available measurements, if a particular gene exhibits a periodic behavior. To this end we propose a principled method relying on the Stochastic Complexity (SC) whose computation is discussed for the generalized Gaussian distribution. We also investigate the relationship between SC, the well-known Minimum Description Length (MDL) formula, and the Bayesian Information Criterion (BIC). The performances of the SC-based approach are compared for simulated and real data with methods that are widely accepted in the bioinformatics community.



Proceedings ArticleDOI
10 Jun 2007
TL;DR: A more sophisticated fuzzy clustering based method is introduced and it is shown that possiblistic c-means clustering performed the best among several fuzzy clustered approaches and that a new unbiased statistic is able to quantify the gene expression level more accurately.
Abstract: Despite the widespread application of microarray imaging for biomedical research, barriers still exist regarding its reliability and reproducibility for clinical use. A critical problem lies in accurate spot segmentation and quantification of gene expression level (mRNA) from microarray images. A variety of commercial and research freeware packages are available, but most cannot handle array spots with complex shapes such as donuts and scratches. Clustering approaches such as k-means and mixture models were introduced to overcome this difficulty, which used the hard labeling of each pixel. In this paper, we introduce a more sophisticated fuzzy clustering based method. We show that possiblistic c-means clustering performed the best among several fuzzy clustering approaches. In addition, we compared three statistical criteria in measuring gene expression levels and show that a new unbiased statistic is able to quantify the gene expression level more accurately. The proposed algorithms have been tested on a variety of simulated and real microarray images, demonstrating their better performance.

Proceedings ArticleDOI
10 Jun 2007
TL;DR: A new clustering method based on Linear Predictive Coding is presented to provide enhanced microarray data analysis that provides improved clustering accuracy compared to some conventional clustering techniques.
Abstract: Microarrays are powerful tools for simultaneous monitoring of the expression levels of large number of genes. Their analysis is usually achieved by using clustering techniques. In this paper, we present a new clustering method based on Linear Predictive Coding to provide enhanced microarray data analysis. In this approach, spectral analysis of microarray data is performed to classify samples according to their distortion values. The technique was validated for a standard data set. Comparative analysis of the results indicates that this method provides improved clustering accuracy compared to some conventional clustering techniques. Moreover, our classifier does not require any prior training procedure.

Proceedings ArticleDOI
10 Jun 2007
TL;DR: Two extensions of an ordered list comparison measure are described, recently proposed for comparing Internet search engines: the use of random permutations to assess the significance of differences between ordered lists, and a graphical extension that highlights the items responsible for the main differences between two lists.
Abstract: There is growing interest in using rank-ordered gene lists to avoid excessive dependence on measured gene expression levels, which can vary strongly across experiments, platforms, or analysis methods. As a useful tool for working with these lists, this paper describes two extensions of an ordered list comparison measure, recently proposed for comparing Internet search engines: the use of random permutations to assess the significance of differences between ordered lists, and a graphical extension that highlights the items responsible for the main differences between two lists. The method is illustrated for a prostate cancer example from the genomics literature.

Proceedings ArticleDOI
10 Jun 2007
TL;DR: This paper considers the problem of finding global states incoming to a specified global state in a Boolean network, and shows that this problem is NP-hard in general and presents algorithms that are much faster than the naive exhaustive search-based algorithm.
Abstract: This paper considers the problem of finding global states incoming to a specified global state in a Boolean network, which may be useful for pre-processing of finding a sequence of control actions for a Boolean network and for identifying the basin of attraction for a given attractor. We show that this problem is NP-hard in general along with related theoretical results. On the other hand, we present algorithms that are much faster than the naive exhaustive search-based algorithm.

Proceedings ArticleDOI
10 Jun 2007
TL;DR: A mathematical model of EGF receptor signaling in PCRV cells based on the experimental data was constructed and it was confirmed with time-course western blot data, showing that the model could successfully reproduce all experimental results.
Abstract: The epidermal growth factor receptor (EGFR) and its signaling pathways are closely involved in the development and progression of prostate cancer. To understand the EGF-stimulated signaling pathways in prostate cancer cells, we constructed a mathematical model of EGF receptor signaling in PCRV cells based on the experimental data. The model was described by twenty-seven ordinary differential equations, and all parameters were estimated by genetic algorithm. The validity of the model was confirmed with time-course western blot data, showing that the model could successfully reproduce all experimental results.

Proceedings ArticleDOI
10 Jun 2007
TL;DR: This paper combines the compressor ProtComp previously designed only for amino acid sequences, with a dictionary based method, where the dictionary containing the patterns for representing the secondary structure is obtained by suitably processing the Dictionary of Protein Secondary Structure data base.
Abstract: In this paper we study the problem of jointly encoding the amino acid sequence and the secondary structure information of proteins, in the current format in which more and more proteins are stored in Swiss-Prot database. The new method, dubbed ProtCompSecS, combines the compressor ProtComp previously designed only for amino acid sequences, with a dictionary based method, where the dictionary containing the patterns for representing the secondary structure is obtained by suitably processing the Dictionary of Protein Secondary Structure (DSSP) data base. We experimented with protein sequences of 14 complete proteomes. When comparing the performance of ProtCompSecS algorithm with that of ProtComp algorithm, for those sequences that have annotated secondary structure information, it surprisingly appeared that encoding both sequence and secondary structure information is more efficient than encoding the protein sequence alone (without knowledge of the secondary structure). This is a strong argument for claiming that the secondary structure has a high descriptive value for modeling and understanding the primary structure (the amino acid sequence) of a protein.

Proceedings Article
01 Jan 2007
TL;DR: This new model is based on the concept of resonant-like interaction between RNA polymerase and hairpins of RNA secondary structure formed during transcription and expresses the probability of termination vs. the concentration of charged amino acyl-tRNA or of amino acid itself.
Abstract: RNAmodel web server was recently established at the IITP RAS to implement our previously proposed model of the classic attenuation regulation of gene expression in bacteria. This new model is based on the concept of resonant-like interaction between RNA polymerase and hairpins of RNA secondary structure formed during transcription. Our modeling relies on standard Monte Carlo procedures and covers all essential stages of the process, including initiation and elongation of transcription and translation; the deceleration of ribosome on regulatory codons, which depends on the concentration of charged amino acyl-tRNA; the polymerase shifting delay caused by secondary structure folded into the mRNA segment between ribosome and polymerase; and, ultimately, either transcription terminating prematurely or polymerase reaching a region of structural genes (antitermination). By means of Monte Carlo simulation we build a function p(c) which expresses the probability of termination (i.e., an enzyme activity) vs. the concentration of charged amino acyl-tRNA or of amino acid itself, measured in actual or relative units.

Proceedings ArticleDOI
10 Jun 2007
TL;DR: A probabilistic model of the kinetics of hybridization is developed and a procedure for the estimation of its parameters which include the binding rate and target concentration is described, which is an important step towards developing optimal detection algorithms for the microarrays which measure the kinetic process.
Abstract: Conventional fluorescent-based microarrays acquire data after the hybridization phase. In this phase the targets analytes (i.e., DNA fragments) bind to the capturing probes on the array and supposedly reach a steady state. Accordingly, microarray experiments essentially provide only a single, steady-state data point of the hybridization process. On the other hand, a novel technique (i.e., realtime microarrays) capable of recording the kinetics of hybridization in fluorescent-based microarrays has recently been proposed in [5]. The richness of the information obtained therein promises higher signal-to-noise ratio, smaller estimation error, and broader assay detection dynamic range compared to the conventional microarrays. In the current paper, we develop a probabilistic model of the kinetics of hybridization and describe a procedure for the estimation of its parameters which include the binding rate and target concentration. This probabilistic model is an important step towards developing optimal detection algorithms for the microarrays which measure the kinetics of hybridization, and to understanding their fundamental limitations.

Journal ArticleDOI
30 Dec 2007
TL;DR: This paper tried to present how page segmentation is done on the OCR systems by discussing the XY Cutpage segmentation algorithm and the result of XY Cut page segmentsation is evaluated on scanned OCR document.
Abstract: Page segmentation is an important field to analyse patterns from the OCR Systems. In this paper we tried to present how page segmentation is done on the OCR systems. We discussed the XY Cut page segmentation algorithm. The result of XY Cut page segmentation is evaluated on scanned OCR document.

Proceedings ArticleDOI
10 Jun 2007
TL;DR: A novel approach for the integration of expression quantitative trait loci (eQTL) data with protein-protein interaction (PPI) data is presented, which facilitates eQTL as a rich data source for the unbiased inference of regulatory pathways.
Abstract: The automated inference or prediction of protein-protein interaction networks from large-scale measurements and other genomic data has become a standard technique in systems biology. However, typically these networks only represent undirected interactions between proteins, without classifying the type and directionality of interactions. Regulatory interactions transmit signals and are activating or repressing. As a step towards more detailed understanding of such regulatory networks, we present a novel approach for the integration of expression quantitative trait loci (eQTL) data with protein-protein interaction (PPI) data. Application of this approach to a new yeast interaction network with 3,491 proteins and 16,438 interactions (covering PPI and transcriptional interactions) allows us to infer the directionality of interactions and also to identify pathways that regulate the expression of individual genes. Inferred pathways contain chains of PPI as well as transcription factor - DNA interactions. We discuss the regulation of the DNA damage related transcription factor Rpn4p as an example. This new approach facilitates eQTL as a rich data source for the unbiased inference of regulatory pathways.

Proceedings ArticleDOI
10 Jun 2007
TL;DR: A new joint clustering and encoding algorithm which selects the cluster center by minimizing the overall code length when encoding the cluster centers (block prototypes) and the blocks conditional on the block prototypes is proposed, resulting in a more compact conditional description.
Abstract: This paper proposes a new method for finding block structure in haplotypes. The new method belongs to the family of minimum description length (MDL) methods, which were intensively investigated in connection with this problem also in the past. Within MDL paradigm we evaluate the code length by using the normalized maximum likelihood (NML) model, as opposed to two part codes used in the past, resulting in a more compact conditional description. Also we propose a new joint clustering and encoding algorithm which selects the cluster centers by minimizing the overall code length when encoding the cluster centers (block prototypes) and the blocks conditional on the block prototypes. The minimized description length provided by the new algorithm is shown to be smaller than that obtained by previous methods when applied to real haplo-type data. The inference of the block boundaries using this better code length measure produces different results than the previous methods, reducing significantly the description length of the overall haplotype partition.

Proceedings ArticleDOI
10 Jun 2007
TL;DR: A universal algorithm for creating compressed archives with instantaneous access and decodability of designated functional elements is introduced and a special-purpose variant is also given to enhance performance for DNA sequences.
Abstract: This article introduces a universal algorithm for creating compressed archives with instantaneous access and decodability of designated functional elements. A special-purpose variant is also given to enhance performance for DNA sequences. The resulting algorithm integrated into an earlier scheme achieves a marked improvement at the randomly accessible coding for annotated genome files, while completely retaining the functionality of instantaneous retrieval of all feature entries.

Proceedings Article
01 Apr 2007