scispace - formally typeset
Search or ask a question

Showing papers presented at "Bioinformatics and Bioengineering in 2005"


Proceedings Article•DOI•
19 Oct 2005
TL;DR: A novel tracker to capture the human breathing signal through an infrared imaging method that can handle significant head movement and object occlusion and mean shift localization-based particle filtering is proposed.
Abstract: In this paper, we propose a novel tracker to capture the human breathing signal through an infrared imaging method. Human facial physiology information is used to select salient thermal features on the human face as good features to track. The major component of the tracker is mean shift localization (MSL)-based particle filtering. A special measurement model is designed for particle filtering so that the tracker can handle significant head movement and object occlusion. The breathing signal is achieved based on tracking results. The experiments show that the tracker is robust and stable and the recovered breathing signal is clear enough for breathing functionality computation.

61 citations


Proceedings Article•DOI•
19 Oct 2005
TL;DR: The methodology presented here is based on the automatic detection of abnormal WCE patterns in an effort for reducing the reading time of the WCE images and the cost of the procedure as well.
Abstract: This paper presents a methodology for detecting abnormal patterns in wireless capsule endoscopy (WCE) images. In particular, an average of 50,000 images are obtained during an WCE exam. Usually, these images are reviewed in a form of a video at speeds between 5 to 40 image-frames/sec. The time spent by a physician reading the results of WCE images varies between 45 to 180 minutes. This presents a major problem which is the reading process that consumes a significant amount of time and the results take several days before they become available since the physician has to find the time to study each video uninterrupted for up to 3 hours. The methodology presented here is based on the automatic detection of abnormal WCE patterns in an effort for reducing the reading time of the WCE images and the cost of the procedure as well. The methodology consists of a synergistic integration of image processing, analysis and recognition techniques for achieving the automatic detection of the WCE abnormal patterns.

52 citations


Proceedings Article•DOI•
19 Oct 2005
TL;DR: This paper presents a hierarchical strategy for structural classification that first partitions proteins based on their SCOP class before attempting to assign a protein fold, and achieves an average fold recognition of 74%, significantly higher than the 56-60% previously reported in the literature.
Abstract: The classification of proteins based on their structure can play an important role in the deduction or discovery of protein function. However, the relatively low number of solved protein structures and the unknown relationship between structure and sequence requires an alternative method of representation for classification to be effective. Furthermore, the large number of potential folds causes problems for many classification strategies, increasing the likelihood that the classifier will reach a local optima while trying to distinguish between all of the possible structural categories. Here we present a hierarchical strategy for structural classification that first partitions proteins based on their SCOP class before attempting to assign a protein fold. Using a well-known dataset derived from the 27 most-populated SCOP folds and several sequence-based descriptor properties as input features, we test a number of classification methods, including Naive Bayes and Boosted C4.5. Our strategy achieves an average fold recognition of 74%, which is significantly higher than the 56-60% previously reported in the literature, indicating the effectiveness of a multi-level approach.

33 citations


Proceedings Article•DOI•
19 Oct 2005
TL;DR: A novel definition of the neighborhood of a node is presented and an effective algorithm for finding network motifs is presented, which seeks a neighborhood assignment for each node such that the induced neighborhoods are partitioned with no overlap.
Abstract: Network motifs have been demonstrated to be the building blocks in many biological networks such as transcriptional regulatory networks. Finding network motifs plays a key role in understanding system level functions and design principles of molecular interactions. In this paper, we present a novel definition of the neighborhood of a node. Based on this concept, we formally define and present an effective algorithm for finding network motifs. The method seeks a neighborhood assignment for each node such that the induced neighborhoods are partitioned with no overlap. We then present a parallel algorithm to find network motifs using a parallel cluster. The algorithm is applied on an E. coli transcriptional regulatory network to find motifs with size up to six. Compared with previous algorithms, our algorithm performs better in terms of running time and precision. Based on the motifs that are found in the network, we further analyze the topology and coverage of the motifs. The results suggest that a small number of key motifs can form the motifs of a bigger size. Also, some motifs exhibit a correlation with complex functions. This study presents a framework for detecting the most significant recurring subgraph patterns in transcriptional regulatory networks.

32 citations


Proceedings Article•DOI•
19 Oct 2005
TL;DR: Novel preprocessing techniques, based on typological measures of the network, are presented, to identify clusters of proteins from protein-protein interaction (PPI) networks wherein each cluster corresponds to a group of functionally similar proteins.
Abstract: In this article we present novel preprocessing techniques, based on typological measures of the network, to identify clusters of proteins from protein-protein interaction (PPI) networks wherein each cluster corresponds to a group of functionally similar proteins. The two main problems with analyzing protein-protein interaction networks are their scale-free property and the large number of false positive interactions that they contain. Our preprocessing techniques use a key transformation and separate weighting functions to effectively eliminate suspect edges, potential false positives, from the graph. A useful side-effect of this transformation is that the resulting graph is no longer scale free. We then examine the application of two well-known clustering techniques, namely hierarchical and multilevel graph partitioning on the reduced network. We define suitable statistical metrics to evaluate our clusters meaningfully. From our study, we discover that the application of clustering on the pre-processed network results in significantly improved, biologically relevant and balanced clusters when compared with clusters derived from the original network. We strongly believe that our strategies would prove invaluable to future studies on prediction of protein functionality from PPI networks.

28 citations


Proceedings Article•DOI•
19 Oct 2005
TL;DR: The presented methodology is based on a neural net model available in the AIIS Inc. and the results were tested at the AT research lab, offering promising results in comparison with the ones made by the given imaging RBIS.
Abstract: This paper deals with development of a methodology for detecting bleeding in WCE images. The presented methodology is based on a neural net model available in the AIIS Inc. and the results of this methods were tested at the AT research lab. The performance of our method offers promising results in comparison with the ones made by the given imaging RBIS (red blood identification system), which vary from 21% to 41%. Our method is under improvements and the expected results will reach near 80% or more.

27 citations


Proceedings Article•DOI•
19 Oct 2005
TL;DR: This paper presents the novel granular support vector machines recursive feature elimination (GSVM-RFE) algorithm for the gene selection task, which can separately eliminate irrelevant, redundant or noisy genes in different granules at different stages and can select positively related genes and negatively related genes in balance.
Abstract: Selecting the most informative cancer-related genes from huge microarray gene expression data is an important and challenging bioinformatics research topic. This paper presents the novel granular support vector machines recursive feature elimination (GSVM-RFE) algorithm for the gene selection task. As a biologically meaningful hybrid method of statistical learning theory and granular computing theory, GSVM-RFE can separately eliminate irrelevant, redundant or noisy genes in different granules at different stages and can select positively related genes and negatively related genes in balance. Simulation results on the prostate cancer dataset show that GSVM-RFE is statistically much more accurate than traditional algorithms for the prostate cancer classification. More importantly, GSVM-RFE extracts a compact "perfect" gene subset of 17 genes with 100% accuracy. To our best knowledge, this is the first time such a "perfect" gene subset is reported, which is expected to be helpful for prostate cancer study.

26 citations


Proceedings Article•DOI•
19 Oct 2005
TL;DR: This work provides a quadratic integer program (QIP) formulation for the HPP problem, and describes an algorithm based on a semi-definite programming (SDP) relaxation of that QIP program that is capable of incorporating a variety of additional constraints.
Abstract: Diploid organisms, such as humans, inherit one copy of each chromosome (haplotype) from each parent. The conflation of inherited haplotypes is called the genotype of the organism. In many disease association studies, the haplotype data is more informative than the genotype data. Unfortunately, getting haplotype data experimentally is both expensive and difficult. The haplotype inference with pure parsimony (HPP) problem is the problem of finding a minimal set of haplotypes that resolve a given set of genotypes. We provide a quadratic integer program (QIP) formulation for the HPP problem, and describe an algorithm for the HPP problem based on a semi-definite programming (SDP) relaxation of that QIP program. We compare our approach with existing approaches. Further, we show that the proposed approach is capable of incorporating a variety of additional constraints, such as missing or erroneous genotype data, outliers etc.

23 citations


Proceedings Article•DOI•
19 Oct 2005
TL;DR: A 2D vibration array for detecting dynamic changes in 3D space during navigation is presented and provides these changes in real-time to visual impaired users (in a form of vibration) in order to develop a 3D sensing of the space and assist their navigation in their working and living environment.
Abstract: This paper presents a 2D vibration array for detecting dynamic changes in 3D space during navigation and provides these changes in real-time to visual impaired users (in a form of vibration) in order to develop a 3D sensing of the space and assist their navigation in their working and living environment. This vibration array is a part of the Tyflos prototype device (consisting of two tiny cameras, a microphone, an ear speaker mounted into a pair of dark glasses and connected into a portable PC) for blind individuals. The overall idea is of detecting changes in a 3D space is based on fusing range data and image data captured by the cameras and creating the 3D representation of the surrounding space. This 3D representation of the space and its changes are mapped onto a 2D vibration array placed on the chest of the blind user. The degree of vibration offers a sensing of the 3D space and its changes to the user.

22 citations


Proceedings Article•DOI•
19 Oct 2005
TL;DR: First, another input parameter from EMP is eliminated: the minimum number of binding sites in the DNA sequences, which reduces the burden of the user and may give more realistic/robust results.
Abstract: Finding common patterns, or motifs, from a set of DNA sequences is an important problem in molecular biology. Most motif-discovering algorithms/software require the length of the motif as input. Motivated by the fact that the motifs length is usually unknown in practice, Styczynski et al. introduced the extended (l,d)-motif problem (EMP), where the motifs length is not an input parameter. Unfortunately, the algorithm given by Styczynski et al. to solve EMP can take an unacceptably long time to run, e.g. over 3 months to discover a length-14 motif. This paper makes two main contributions. First, we eliminate another input parameter from EMP: the minimum number of binding sites in the DNA sequences. Fewer input parameters not only reduces the burden of the user, but also may give more realistic/robust results since restrictions on length or on the number of binding sites make little sense when the best motif may not be the longest nor have the largest number of binding sites. Second, we develop an efficient algorithm to solve our redefined problem. The algorithm is also a fast solution for EMP (without any sacrifice to accuracy) making EMP practical.

20 citations


Proceedings Article•DOI•
19 Oct 2005
TL;DR: A novel characterization of the building blocks of repeats is given, called elementary repeats, which leads to a natural definition of repeat boundaries, which is highly accurate and design efficient algorithms and test them on synthetic and real biological data.
Abstract: The accurate identification of repeats remains a challenging open problem in bioinformatics. Most existing methods of repeat identification either depend on annotated repeat databases or restrict repeats to pairs of similar sequences that are maximal in length. The fundamental flaw in most of the available methods is the lack of a definition that correctly balances the importance of the length and the frequency. In this paper, we propose a new definition of repeats that satisfies both criteria. We give a novel characterization of the building blocks of repeats, called elementary repeats, which leads to a natural definition of repeat boundaries. We design efficient algorithms and test them on synthetic and real biological data. Experimental results show that our method is highly accurate.

Proceedings Article•DOI•
19 Oct 2005
TL;DR: This paper proposes an add-on algorithm applied to any single gene-based discriminative scores integrating domain knowledge from gene ontology annotation integratingdomain knowledge into the gene selection process to alleviate the high false discover rate problem.
Abstract: Selecting informative genes from microarray experiments is one of the most important data analysis steps for deciphering biological information imbedded in such experiments. However, due to the characteristics of microarray technology and the underlying biology, namely large number of genes and limited number of samples, the statistical soundness of gene selection algorithm becomes questionable. One major problem is the high false discover rate. Microarray experiment is only one facet of current knowledge of the biological system under study. In this paper, we propose to alleviate this high false discover rate problem by integrating domain knowledge into the gene selection process. Gene ontology represents a controlled biological vocabulary and a repository of computable biological knowledge. It is shown in the literature that gene ontology-based similarities between genes carry significant information of the functional relationships. Integration of such domain knowledge into gene selection algorithms enables us to remove noisy genes intelligently. We propose an add-on algorithm applied to any single gene-based discriminative scores integrating domain knowledge from gene ontology annotation. Preliminary experiments are performed on publicly available colon cancer dataset to demonstrate the utility of the integration of domain knowledge for the purpose of gene selection. Our experiments show interesting results.

Proceedings Article•DOI•
19 Oct 2005
TL;DR: MPME is proposed, a string encoding of gene adjacency relationships whose optimal internal node assignments can be determined globally in polynomial time, to provide better initializations for GRAPPA and yields shorter tree lengths and better accuracy in phylogeny reconstruction.
Abstract: The study of genome rearrangements, the evolutionary events that change the order and strandedness of genes within genomes, presents new opportunities for discoveries about deep evolutionary events. The best software so far, GRAPPA, solves breakpoint and inversion phylogenies by scoring each tree topology through iterative improvements of internal node gene orders. We find that the greedy hill-climbing approach means the accuracy is limited because of multiple local optima. To address this problem, we propose integration GRAPPA with MPME, a string encoding of gene adjacency relationships whose optimal internal node assignments can be determined globally in polynomial time, to provide better initializations for GRAPPA. In simulation studies, the new algorithm yields shorter tree lengths and better accuracy in phylogeny reconstruction.

Proceedings Article•DOI•
19 Oct 2005
TL;DR: A framework of multi-scale clustering, where clustering is done with multiple scale values and then the obtained results are compiled into a visually appropriate form to observe overall structures of the clusters is discussed.
Abstract: In cluster analyses, setting the scale parameter which is implicitly related to the complexity of the data distribution is an important issue; different scale values lead to different results and hence cause different interpretation. In this study, we discuss a framework of multi-scale clustering, where clustering is done with multiple scale values and then the obtained results are compiled into a visually appropriate form to observe overall structures of the clusters. For such purpose, a brick view method is proposed in this study. The construction of a brick view diagram consists of a reindexing procedure of clusters obtained with various scale values and a sorting procedure of samples so as to minimize the distortion defined based on the multiple clustering results. Although some popular clustering methods, such as K-means, spherical K-means, and hierarchical clustering, have been used within the multi-scale framework, we introduce mean-shift clustering based on the kernel density estimation for directional data. We evaluate our approach and existing hierarchical clustering by using an artificial data set and a real data set of gene expression profiles. The results show global structures of distributions can be observed well and in a stable manner, in the brick view diagram.

Proceedings Article•DOI•
19 Oct 2005
TL;DR: This paper derives six distinct correlation functions based on explicit thermodynamic modeling of gene regulatory networks and combines these correlation functions with novel biclustering algorithms to identify functionally enriched groups.
Abstract: The attempt to elucidate biological pathways and classify genes has led to the development of numerous clustering approaches to gene expression. All these approaches use a single metric to identify genes with similar expression levels. Until now, the correlation between the expression levels of such genes has been based on phenomenological and heuristic correlation functions, rather than on biological models. In this paper, we derive six distinct correlation functions based on explicit thermodynamic modeling of gene regulatory networks. We then combine these correlation functions with novel biclustering algorithms to identify functionally enriched groups. The statistical significance of the identified groups is demonstrated by precision-recall curves and calculated p-values. Furthermore, comparison with chromatin immunoprecipitation data indicates that the performance of the derived correlation functions depends on the specific regulatory mechanisms. Finally, we introduce the idea of multi-class biclustering and with the help of support vector machines we demonstrate its improved classification performance in a microarray dataset.

Proceedings Article•DOI•
19 Oct 2005
TL;DR: A combinatorial fusion analysis technique is used to facilitate feature selection and combination for improving predictive accuracy in protein structure classification and has an overall prediction accuracy rate of 87% for four classes and 69.6% for 27 folding categories.
Abstract: The classification of protein structures is essential for their function determination in bioinformatics. The success of the protein structure classification depends on two factors: the computational methods used and the features selected. In this paper, we use a combinatorial fusion analysis technique to facilitate feature selection and combination for improving predictive accuracy in protein structure classification. When applying these criteria to our previous work, the resulting classification has an overall prediction accuracy rate of 87% for four classes and 69.6% for 27 folding categories. These rates are significantly higher than our previous work and demonstrate that combinatorial fusion is a valuable method for protein structure classification.

Proceedings Article•DOI•
19 Oct 2005
TL;DR: A lightweight index structure, namely, the succeeding unit array (the SUA) is designed based on pattern unit, which decreases the space consumption efficiently and solves the space bottleneck in the search of repetitions.
Abstract: This paper proposes a new concept of repetitions, the largest pattern repetition (the LPR) and a concept of pattern unit. A lightweight index structure, namely, the succeeding unit array (the SUA) is designed based on pattern unit. The SUA decreases the space consumption efficiently and solves the space bottleneck in the search of repetitions. On the SUA all the atomic patterns which constitute the LPRs can be detected and the LPRs can be identified by connecting the same patterns. The theoretical analysis and experimental results show that both space and time complexity of the algorithms is O(n).

Proceedings Article•DOI•
19 Oct 2005
TL;DR: A probabilistic model to define the similarity based on conditional probabilities is proposed and a two-step method for estimating the similarity between two proteins based on protein interaction profile is proposed.
Abstract: High-throughput methods for detecting protein-protein interactions (PPI) have given researchers an initial global picture of protein interactions on a genomic scale. The huge data sets generated by such experiments pose new challenges in data analysis. Though clustering methods have been successfully applied in many areas in bioinformatics many clustering algorithms cannot be readily applied on protein interaction data sets. One main problem is that the similarity between two proteins cannot be easily defined. This paper proposes a probabilistic model to define the similarity based on conditional probabilities. We then propose a two-step method for estimating the similarity between two proteins based on protein interaction profile. In the first step, the model is trained with proteins with known annotation. Based on this model, similarities are calculated in the second step. Experiments show that our method improves performance.

Proceedings Article•DOI•
19 Oct 2005
TL;DR: The experimental results show that the probability for the correct tree to be among a very small number of trees constructed using the method is very high, which opens a new research direction to further investigate more efficient algorithms for phylogenetic inference.
Abstract: In this paper we introduce a new quartet-based method This method makes use of the Bayes (or quartet) weights of quartets as those used in the quartet puzzling However, all the weights from the related quartets are accumulated to form a global quartet weight matrix This matrix provides integrated information and can lead us to recursively merge small sub-trees to larger ones until the final single tree is obtained The experimental results show that the probability for the correct tree to be among a very small number of trees constructed using our method is very high These significant results open a new research direction to further investigate more efficient algorithms for phylogenetic inference

Proceedings Article•DOI•
19 Oct 2005
TL;DR: A parallel algorithm for the constrained multiple sequence alignment (CMSA) problem that seeks an optimal multiple alignment constrained to include a given pattern and is faster in general than the existing sequential dynamic programming solutions.
Abstract: We propose a parallel algorithm for the constrained multiple sequence alignment (CMSA) problem that seeks an optimal multiple alignment constrained to include a given pattern. We consider the dynamic programming computations in layers indexed by the symbols of the given pattern. In each layer we compute as a potential part of an optimal alignment for the CMSA problem, shortest paths for multiple sources and multiple destinations. These shortest paths problems are independent from one another (which enables parallel execution), and each can be solved using an A* algorithm specialized for the shortest paths problem for multiple sources and multiple destinations. The final step of our algorithm solves a single source single destination shortest path problem. Our experiments on real sequences show that our algorithm is faster in general than the existing sequential dynamic programming solutions.

Proceedings Article•DOI•
Ssu-Hua Huang1, Ru-Sheng Liu1, Chien-Yu Chen1, Ya-Ting Chao1, Shu-Yuan Chen1 •
19 Oct 2005
TL;DR: This paper proposes a method for discriminating outer membrane proteins from other proteins by support vector machines using combinations of gapped amino acid pair compositions that outperforms the OM classifier of PSORTb v.2.0 and a method based on dipeptide composition.
Abstract: Discriminating outer membrane proteins from proteins with other subcellular localizations and with other folding classes are both important to predict farther their functions and structures. In this paper, we propose a method for discriminating outer membrane proteins from other proteins by support vector machines using combinations of gapped amino acid pair compositions. Using 5-fold cross-validation, the method achieves 95% precision and 92% recall on the dataset of proteins with well-annotated subcellular localizations, consisting of 471 outer membrane proteins and 1,120 other proteins. When applied on another dataset of 377 outer membrane proteins and 674 globular proteins belonging to four typical structural classes, the method reaches 96% precision and recall and correctly excludes 98% of the globular proteins. Our method outperforms the OM classifier of PSORTb v.2.0 and a method based on dipeptide composition.

Proceedings Article•DOI•
19 Oct 2005
TL;DR: A novel model-free and stable gene selection method is proposed in this paper, which does not assume any statistical model on the gene expression data and it is not affected by the unbalanced samples.
Abstract: Microarray data analysis is notorious for involving a huge number of genes compared to a relatively small number of samples. Detecting the most significantly differentially expressed genes under different conditions, or gene selection, has been a central focus for researchers. The gene selection problem becomes more difficult when the numbers of samples under different conditions vary significantly, or are unbalanced. A novel model-free and stable gene selection method is proposed in this paper, i.e., the method does not assume any statistical model on the gene expression data and it is not affected by the unbalanced samples. The method has been evaluated on two publicly available datasets, the leukemia dataset and the small round blue cell tumor dataset, where the experimental results showed that the proposed method is efficient and robust in identifying differentially expressed genes.

Proceedings Article•DOI•
19 Oct 2005
TL;DR: A Distance-based Sequence Indexing Method (DSIM) for indexing and searching genome databases is proposed, borrowing the idea of video compression, which achieves significantly faster response time than BLAST, while maintaining comparable accuracy.
Abstract: In this paper, we propose a Distance-based Sequence Indexing Method (DSIM) for indexing and searching genome databases. Borrowing the idea of video compression, we compress the genomic sequence database around a set of automatically selected reference words, formed from high-frequency data substrings and substrings in past queries. The compression captures the distance of each non-reference word in the database to some reference word. At runtime, a query is processed by comparing its substrings with the compressed data strings, through their distances to the reference words. We also propose an efficient scheme to incrementally update the reference words and the compressed data sequences as more data sequences are added and new queries come along. Extensive experiments on a human genome database with 2.62 GB of DNA sequence letters show that the new algorithm achieves significantly faster response time than BLAST, while maintaining comparable accuracy.

Proceedings Article•DOI•
19 Oct 2005
TL;DR: This investigation identifies phylogenetic motifs using heuristic maximum parsimony trees and shows that when using parsimony the functional site prediction accuracy of PMs improves substantially, particularly on divergent datasets.
Abstract: We have recently demonstrated (La et al, Proteins, 58:2005) that sequence fragments approximating the overall familial phylogeny, called phylogenetic motifs (PMs), represent a promising protein functional site prediction strategy. Previous results across a structurally and functionally diverse dataset indicate that phylogenetic motifs correspond to a wide variety of known functional characteristics. Phylogenetic motifs are detected using a sliding window algorithm that compares neighbor joining trees on the complete alignment to those on the sequence fragments. In this investigation we identify PMs using heuristic maximum parsimony trees. We show that when using parsimony the functional site prediction accuracy of PMs improves substantially, particularly on divergent datasets. We also show that the new PMs found using parsimony are not necessarily conserved in sequence, and, therefore, would not be detected by traditional motif (information content-based) approaches.

Proceedings Article•DOI•
Chin-Tang Hsieh1, Guang-Lin Hsieh1, E. Lai1, Zong-Ting Hsieh1, Guo-Ming Hong1 •
19 Oct 2005
TL;DR: The MSP is used to implement a finite impulse response (FIR) filter which is equiripple design which integrates the ringed buffer for the input samples and the symmetrical characteristic of the FIR filter for efficiently computing convolution.
Abstract: A low power, portable, and easily implemented Holter recorder is necessary for patients or researchers of electrocardiogram (ECG). Such a Holter recorder with off-the-shelf components is realized with mixed signal processor (MSP) in this paper. To decrease the complexity of analog circuits and the interference of 60 Hz noise from power line, we use the MSP to implement a finite impulse response (FIR) filter which is equiripple design. We also integrate the ringed buffer for the input samples and the symmetrical characteristic of the FIR filter for efficiently computing convolution. The experimental results show that the output ECG signal with the PQRST feature is easy to be distinguished. This ECG signal is recorded for 24 hr using a SD card. Furthermore, the ECG signal is transmitted with a smartphone via Bluetooth to decrease the burden of the Holier recorder.

Proceedings Article•DOI•
19 Oct 2005
TL;DR: A dynamic index to store the fingerprints of k-grams and a highly scalable and accurate (HSA) algorithm to incorporate randomization into process of seed generation.
Abstract: We propose a method for finding seeds for the local alignment of two nucleotide sequences. Our method uses randomized algorithms to find approximate seeds. We present a dynamic index to store the fingerprints of k-grams and a highly scalable and accurate (HSA) algorithm to incorporate randomization into process of seed generation. Experimental results show that our method produces better quality seeds with improved running time and memory usage compared to traditional non-spaced and spaced seeds. The presented algorithm scales very well with higher seed lengths while maintaining the quality and performance.

Proceedings Article•DOI•
19 Oct 2005
TL;DR: A new RNA pseudoknot prediction method based on term rewriting rather than on dynamic programming, comparative sequence analysis, or context-free grammars is presented, indicating that term rewriting has a broad potential in RNA applications.
Abstract: RNA plays a critical role in mediating every step of cellular information transfer from genes to functional proteins. Pseudoknots are widely occurring structural motifs found in all types of RNA and are also functionally important. Therefore predicting their structures is an important problem. In this paper, we present a new RNA pseudoknot prediction method based on term rewriting rather than on dynamic programming, comparative sequence analysis, or context-free grammars. The method we describe is implemented using the Mfold RNA/DNA folding package and the term rewriting language Maude. Our method was tested on 211 pseudoknots in PseudoBase and achieves an average accuracy of 74.085% compared to the experimentally determined structure. In fact, most pseudoknots discovered by our method achieve an accuracy of above 90%. These results indicate that term rewriting has a broad potential in RNA applications from prediction of pseudoknots to higher level RNA structures involving complex RNA tertiary interactions.

Proceedings Article•DOI•
19 Oct 2005
TL;DR: A novel normalization approach is presented that exploits concurrent identification of invariantly expressed genes (IEGs) and implementation of nonlinear regression normalization and shows a superior performance in achieving low expression variance across replicates and excellent fold change preservation.
Abstract: Normalization is an important prerequisite for almost all follow-up microarray data analysis steps. Accurate normalization assures a common base for comparative biomedical studies using gene expression profiles across different experiments and phenotypes. In this paper, we present a novel normalization approach - iterative nonlinear regression (INR) method - that exploits concurrent identification of invariantly expressed genes (IEGs) and implementation of nonlinear regression normalization. We demonstrate the principle and performance of the INR approach on two real microarray data sets. As compared to major peer methods (e.g., linear regression method, Loess method and iterative ranking method), INR method shows a superior performance in achieving low expression variance across replicates and excellent fold change preservation.

Proceedings Article•DOI•
19 Oct 2005
TL;DR: This paper presents a method which uses contrast analysis on the frequency of sequences to identify delimiters for optional fields and help complete the layout descriptions of biological datasets.
Abstract: One of the major problems in biological data integration is that many data sources are stored as atlasses, with a variety of different layouts. Integrating data from such sources can be an extremely time-consuming task. We have been developing data mining techniques to help learn the layout of a dataset in a semi-automatic way. In this paper, we focus on the problem of identifying delimiters for optional fields. Since these fields do not occur in every record, frequency based methods are not able to identify the corresponding delimiters. We present a method which uses contrast analysis on the frequency of sequences to identify such delimiters and help complete the layout descriptions. We demonstrate the effectiveness of this technique using three atlasses biological datasets.

Proceedings Article•DOI•
19 Oct 2005
TL;DR: By using a wrapper induction system for creation and maintenance of wrappers, scalability, flexibility, and stability of the integrated information system is easily maintained.
Abstract: Integrating life science Web databases, while important and necessary, is a challenge for current integration systems mainly due to the large number of these databases, their heterogeneity and the fact that their interfaces may change often. BACIIS, a biological and chemical information integration system, is a tightly coupled federated database system that uses the mediator wrapper method in order to retrieve information from several remote Web databases. BACIIS relies on a semi-automated approach for generating and maintaining wrappers in order to provide a scalable system with a limited maintenance overhead. The semi-automatic wrapper induction in BACIIS is efficient because it is based on, but not limited to a domain knowledge. Tests show that the use of ontology increases the accuracy of the wrapper induction. We also present how the wrapper induction system facilitates wrapper update, and assists in the information extraction. By using a wrapper induction system for creation and maintenance of wrappers, scalability, flexibility, and stability of the integrated information system is easily maintained.