Showing papers presented at "Bioinformatics and Bioengineering in 2005"

PDF

Open Access

Proceedings Article•DOI•

Tracking human breath in infrared imaging

[...]

Zhen Zhu¹, Jin Fei¹, Ioannis Pavlidis¹•Institutions (1)

19 Oct 2005

TL;DR: A novel tracker to capture the human breathing signal through an infrared imaging method that can handle significant head movement and object occlusion and mean shift localization-based particle filtering is proposed.

...read moreread less

Abstract: In this paper, we propose a novel tracker to capture the human breathing signal through an infrared imaging method. Human facial physiology information is used to select salient thermal features on the human face as good features to track. The major component of the tracker is mean shift localization (MSL)-based particle filtering. A special measurement model is designed for particle filtering so that the tracker can handle significant head movement and object occlusion. The breathing signal is achieved based on tracking results. The experiments show that the tracker is robust and stable and the recovered breathing signal is clear enough for breathing functionality computation.

...read moreread less

61 citations

Proceedings Article•DOI•

Detecting abnormal patterns in WCE images

[...]

Nikolaos G. Bourbakis¹•Institutions (1)

Wayne State University¹

19 Oct 2005

TL;DR: The methodology presented here is based on the automatic detection of abnormal WCE patterns in an effort for reducing the reading time of the WCE images and the cost of the procedure as well.

...read moreread less

Abstract: This paper presents a methodology for detecting abnormal patterns in wireless capsule endoscopy (WCE) images. In particular, an average of 50,000 images are obtained during an WCE exam. Usually, these images are reviewed in a form of a video at speeds between 5 to 40 image-frames/sec. The time spent by a physician reading the results of WCE images varies between 45 to 180 minutes. This presents a major problem which is the reading process that consumes a significant amount of time and the results take several days before they become available since the physician has to find the time to study each video uninterrupted for up to 3 hours. The methodology presented here is based on the automatic detection of abnormal WCE patterns in an effort for reducing the reading time of the WCE images and the cost of the procedure as well. The methodology consists of a synergistic integration of image processing, analysis and recognition techniques for achieving the automatic detection of the WCE abnormal patterns.

...read moreread less

52 citations

Proceedings Article•DOI•

A multi-level approach to SCOP fold recognition

[...]

Keith Marsolo¹, Srinivasan Parthasarathy¹, Chris Ding²•Institutions (2)

Ohio State University¹, Lawrence Berkeley National Laboratory²

19 Oct 2005

TL;DR: This paper presents a hierarchical strategy for structural classification that first partitions proteins based on their SCOP class before attempting to assign a protein fold, and achieves an average fold recognition of 74%, significantly higher than the 56-60% previously reported in the literature.

...read moreread less

Abstract: The classification of proteins based on their structure can play an important role in the deduction or discovery of protein function. However, the relatively low number of solved protein structures and the unknown relationship between structure and sequence requires an alternative method of representation for classification to be effective. Furthermore, the large number of potential folds causes problems for many classification strategies, increasing the likelihood that the classifier will reach a local optima while trying to distinguish between all of the possible structural categories. Here we present a hierarchical strategy for structural classification that first partitions proteins based on their SCOP class before attempting to assign a protein fold. Using a well-known dataset derived from the 27 most-populated SCOP folds and several sequence-based descriptor properties as input features, we test a number of classification methods, including Naive Bayes and Boosted C4.5. Our strategy achieves an average fold recognition of 74%, which is significantly higher than the 56-60% previously reported in the literature, indicating the effectiveness of a multi-level approach.

...read moreread less

33 citations

Proceedings Article•DOI•

A parallel algorithm for extracting transcriptional regulatory network motifs

[...]

Tie Wang¹, J.W. Touchman², Weiyi Zhang¹, E.B. Suh², Guoliang Xue¹ - Show less +1 more•Institutions (2)

Arizona State University¹, Translational Genomics Research Institute²

19 Oct 2005

TL;DR: A novel definition of the neighborhood of a node is presented and an effective algorithm for finding network motifs is presented, which seeks a neighborhood assignment for each node such that the induced neighborhoods are partitioned with no overlap.

...read moreread less

Abstract: Network motifs have been demonstrated to be the building blocks in many biological networks such as transcriptional regulatory networks. Finding network motifs plays a key role in understanding system level functions and design principles of molecular interactions. In this paper, we present a novel definition of the neighborhood of a node. Based on this concept, we formally define and present an effective algorithm for finding network motifs. The method seeks a neighborhood assignment for each node such that the induced neighborhoods are partitioned with no overlap. We then present a parallel algorithm to find network motifs using a parallel cluster. The algorithm is applied on an E. coli transcriptional regulatory network to find motifs with size up to six. Compared with previous algorithms, our algorithm performs better in terms of running time and precision. Based on the motifs that are found in the network, we further analyze the topology and coverage of the motifs. The results suggest that a small number of key motifs can form the motifs of a bigger size. Also, some motifs exhibit a correlation with complex functions. This study presents a framework for detecting the most significant recurring subgraph patterns in transcriptional regulatory networks.

...read moreread less

32 citations

Proceedings Article•DOI•

Effective pre-processing strategies for functional clustering of a protein-protein interactions network

[...]

Duygu Ucar¹, Srinivasan Parthasarathy¹, Sitaram Asur¹, Chao Wang¹•Institutions (1)

Ohio State University¹

19 Oct 2005

TL;DR: Novel preprocessing techniques, based on typological measures of the network, are presented, to identify clusters of proteins from protein-protein interaction (PPI) networks wherein each cluster corresponds to a group of functionally similar proteins.

...read moreread less

Abstract: In this article we present novel preprocessing techniques, based on typological measures of the network, to identify clusters of proteins from protein-protein interaction (PPI) networks wherein each cluster corresponds to a group of functionally similar proteins. The two main problems with analyzing protein-protein interaction networks are their scale-free property and the large number of false positive interactions that they contain. Our preprocessing techniques use a key transformation and separate weighting functions to effectively eliminate suspect edges, potential false positives, from the graph. A useful side-effect of this transformation is that the resulting graph is no longer scale free. We then examine the application of two well-known clustering techniques, namely hierarchical and multilevel graph partitioning on the reduced network. We define suitable statistical metrics to evaluate our clusters meaningfully. From our study, we discover that the application of clustering on the pre-processed network results in significantly improved, biologically relevant and balanced clusters when compared with clusters derived from the original network. We strongly believe that our strategies would prove invaluable to future studies on prediction of protein functionality from PPI networks.

...read moreread less

28 citations

Proceedings Article•DOI•

A neural network-based detection of bleeding in sequences of WCE images

[...]

Nikolaos G. Bourbakis, Sokratis Makrogiannis¹, D. Kavraki•Institutions (1)

Wright State University¹

19 Oct 2005

TL;DR: The presented methodology is based on a neural net model available in the AIIS Inc. and the results were tested at the AT research lab, offering promising results in comparison with the ones made by the given imaging RBIS.

...read moreread less

Abstract: This paper deals with development of a methodology for detecting bleeding in WCE images. The presented methodology is based on a neural net model available in the AIIS Inc. and the results of this methods were tested at the AT research lab. The performance of our method offers promising results in comparison with the ones made by the given imaging RBIS (red blood identification system), which vary from 21% to 41%. Our method is under improvements and the expected results will reach near 80% or more.

...read moreread less

27 citations

Proceedings Article•DOI•

Granular SVM-RFE gene selection algorithm for reliable prostate cancer classification on microarray expression data

[...]

Yuchun Tang¹, Yan-Qing Zhang¹, Zhen Huang¹, Xiaohua Hu²•Institutions (2)

Georgia State University¹, Drexel University²

19 Oct 2005

TL;DR: This paper presents the novel granular support vector machines recursive feature elimination (GSVM-RFE) algorithm for the gene selection task, which can separately eliminate irrelevant, redundant or noisy genes in different granules at different stages and can select positively related genes and negatively related genes in balance.

...read moreread less

Abstract: Selecting the most informative cancer-related genes from huge microarray gene expression data is an important and challenging bioinformatics research topic. This paper presents the novel granular support vector machines recursive feature elimination (GSVM-RFE) algorithm for the gene selection task. As a biologically meaningful hybrid method of statistical learning theory and granular computing theory, GSVM-RFE can separately eliminate irrelevant, redundant or noisy genes in different granules at different stages and can select positively related genes and negatively related genes in balance. Simulation results on the prostate cancer dataset show that GSVM-RFE is statistically much more accurate than traditional algorithms for the prostate cancer classification. More importantly, GSVM-RFE extracts a compact "perfect" gene subset of 17 genes with 100% accuracy. To our best knowledge, this is the first time such a "perfect" gene subset is reported, which is expected to be helpful for prostate cancer study.

...read moreread less

26 citations

Proceedings Article•DOI•

Haplotype phasing using semidefinite programming

[...]

Konstantinos Kalpakis¹, Parag Namjoshi¹•Institutions (1)

University of Baltimore¹

19 Oct 2005

TL;DR: This work provides a quadratic integer program (QIP) formulation for the HPP problem, and describes an algorithm based on a semi-definite programming (SDP) relaxation of that QIP program that is capable of incorporating a variety of additional constraints.

...read moreread less

Abstract: Diploid organisms, such as humans, inherit one copy of each chromosome (haplotype) from each parent. The conflation of inherited haplotypes is called the genotype of the organism. In many disease association studies, the haplotype data is more informative than the genotype data. Unfortunately, getting haplotype data experimentally is both expensive and difficult. The haplotype inference with pure parsimony (HPP) problem is the problem of finding a minimal set of haplotypes that resolve a given set of genotypes. We provide a quadratic integer program (QIP) formulation for the HPP problem, and describe an algorithm for the HPP problem based on a semi-definite programming (SDP) relaxation of that QIP program. We compare our approach with existing approaches. Further, we show that the proposed approach is capable of incorporating a variety of additional constraints, such as missing or erroneous genotype data, outliers etc.

...read moreread less

23 citations

Proceedings Article•DOI•

A 2D Vibration Array for Sensing Dynamic Changes and 3D Space for Blinds' Navigation

[...]

Nikolaos G. Bourbakis¹, D. Kavraki•Institutions (1)

Wright State University¹

19 Oct 2005

TL;DR: A 2D vibration array for detecting dynamic changes in 3D space during navigation is presented and provides these changes in real-time to visual impaired users (in a form of vibration) in order to develop a 3D sensing of the space and assist their navigation in their working and living environment.

...read moreread less

Abstract: This paper presents a 2D vibration array for detecting dynamic changes in 3D space during navigation and provides these changes in real-time to visual impaired users (in a form of vibration) in order to develop a 3D sensing of the space and assist their navigation in their working and living environment. This vibration array is a part of the Tyflos prototype device (consisting of two tiny cameras, a microphone, an ear speaker mounted into a pair of dark glasses and connected into a portable PC) for blind individuals. The overall idea is of detecting changes in a 3D space is based on fusing range data and image data captured by the cameras and creating the 3D representation of the surrounding space. This 3D representation of the space and its changes are mapped onto a 2D vibration array placed on the chest of the blind user. The degree of vibration offers a sensing of the 3D space and its changes to the user.

...read moreread less

22 citations

Proceedings Article•DOI•

An efficient algorithm for the extended (l,d)-motif problem with unknown number of binding sites

[...]

H.C.M. Leung¹, F.Y.K.L. Chin¹•Institutions (1)

University of Hong Kong¹

19 Oct 2005

TL;DR: First, another input parameter from EMP is eliminated: the minimum number of binding sites in the DNA sequences, which reduces the burden of the user and may give more realistic/robust results.

...read moreread less

Abstract: Finding common patterns, or motifs, from a set of DNA sequences is an important problem in molecular biology. Most motif-discovering algorithms/software require the length of the motif as input. Motivated by the fact that the motifs length is usually unknown in practice, Styczynski et al. introduced the extended (l,d)-motif problem (EMP), where the motifs length is not an input parameter. Unfortunately, the algorithm given by Styczynski et al. to solve EMP can take an unacceptably long time to run, e.g. over 3 months to discover a length-14 motif. This paper makes two main contributions. First, we eliminate another input parameter from EMP: the minimum number of binding sites in the DNA sequences. Fewer input parameters not only reduces the burden of the user, but also may give more realistic/robust results since restrictions on length or on the number of binding sites make little sense when the best motif may not be the longest nor have the largest number of binding sites. Second, we develop an efficient algorithm to solve our redefined problem. The algorithm is also a fast solution for EMP (without any sacrifice to accuracy) making EMP practical.

...read moreread less

20 citations

Proceedings Article•DOI•

Discovery of repetitive patterns in DNA with accurate boundaries

[...]

Jie Zheng¹, Stefano Lonardi¹•Institutions (1)

University of California, Riverside¹

19 Oct 2005

TL;DR: A novel characterization of the building blocks of repeats is given, called elementary repeats, which leads to a natural definition of repeat boundaries, which is highly accurate and design efficient algorithms and test them on synthetic and real biological data.

...read moreread less

Abstract: The accurate identification of repeats remains a challenging open problem in bioinformatics. Most existing methods of repeat identification either depend on annotated repeat databases or restrict repeats to pairs of similar sequences that are maximal in length. The fundamental flaw in most of the available methods is the lack of a definition that correctly balances the importance of the length and the frequency. In this paper, we propose a new definition of repeats that satisfies both criteria. We give a novel characterization of the building blocks of repeats, called elementary repeats, which leads to a natural definition of repeat boundaries. We design efficient algorithms and test them on synthetic and real biological data. Experimental results show that our method is highly accurate.

...read moreread less

Proceedings Article•DOI•

Selecting informative genes from microarray dataset by incorporating gene ontology

[...]

Xian Xu¹, Aidong Zhang¹•Institutions (1)

State University of New York System¹

19 Oct 2005

TL;DR: This paper proposes an add-on algorithm applied to any single gene-based discriminative scores integrating domain knowledge from gene ontology annotation integratingdomain knowledge into the gene selection process to alleviate the high false discover rate problem.

...read moreread less

Abstract: Selecting informative genes from microarray experiments is one of the most important data analysis steps for deciphering biological information imbedded in such experiments. However, due to the characteristics of microarray technology and the underlying biology, namely large number of genes and limited number of samples, the statistical soundness of gene selection algorithm becomes questionable. One major problem is the high false discover rate. Microarray experiment is only one facet of current knowledge of the biological system under study. In this paper, we propose to alleviate this high false discover rate problem by integrating domain knowledge into the gene selection process. Gene ontology represents a controlled biological vocabulary and a repository of computable biological knowledge. It is shown in the literature that gene ontology-based similarities between genes carry significant information of the functional relationships. Integration of such domain knowledge into gene selection algorithms enables us to remove noisy genes intelligently. We propose an add-on algorithm applied to any single gene-based discriminative scores integrating domain knowledge from gene ontology annotation. Preliminary experiments are performed on publicly available colon cancer dataset to demonstrate the utility of the integration of domain knowledge for the purpose of gene selection. Our experiments show interesting results.

...read moreread less

Proceedings Article•DOI•

Improving genome rearrangement phylogeny using sequence-style parsimony

[...]

Jijun Tang¹, Li-San Wang²•Institutions (2)

University of South Carolina¹, University of Pennsylvania²

19 Oct 2005

TL;DR: MPME is proposed, a string encoding of gene adjacency relationships whose optimal internal node assignments can be determined globally in polynomial time, to provide better initializations for GRAPPA and yields shorter tree lengths and better accuracy in phylogeny reconstruction.

...read moreread less

Abstract: The study of genome rearrangements, the evolutionary events that change the order and strandedness of genes within genomes, presents new opportunities for discoveries about deep evolutionary events. The best software so far, GRAPPA, solves breakpoint and inversion phylogenies by scoring each tree topology through iterative improvements of internal node gene orders. We find that the greedy hill-climbing approach means the accuracy is limited because of multiple local optima. To address this problem, we propose integration GRAPPA with MPME, a string encoding of gene adjacency relationships whose optimal internal node assignments can be determined globally in polynomial time, to provide better initializations for GRAPPA. In simulation studies, the new algorithm yields shorter tree lengths and better accuracy in phylogeny reconstruction.

...read moreread less

Proceedings Article•DOI•

Multi-scale clustering for gene expression profiling data

[...]

Shigeyuki Oba¹, K. Kato, Shin Ishii¹•Institutions (1)

Nara Institute of Science and Technology¹

19 Oct 2005

TL;DR: A framework of multi-scale clustering, where clustering is done with multiple scale values and then the obtained results are compiled into a visually appropriate form to observe overall structures of the clusters is discussed.

...read moreread less

Abstract: In cluster analyses, setting the scale parameter which is implicitly related to the complexity of the data distribution is an important issue; different scale values lead to different results and hence cause different interpretation. In this study, we discuss a framework of multi-scale clustering, where clustering is done with multiple scale values and then the obtained results are compiled into a visually appropriate form to observe overall structures of the clusters. For such purpose, a brick view method is proposed in this study. The construction of a brick view diagram consists of a reindexing procedure of clusters obtained with various scale values and a sorting procedure of samples so as to minimize the distortion defined based on the multiple clustering results. Although some popular clustering methods, such as K-means, spherical K-means, and hierarchical clustering, have been used within the multi-scale framework, we introduce mean-shift clustering based on the kernel density estimation for directional data. We evaluate our approach and existing hierarchical clustering by using an artificial data set and a real data set of gene expression profiles. The results show global structures of distributions can be observed well and in a stable manner, in the brick view diagram.

...read moreread less

Proceedings Article•DOI•

Multi-class biclustering and classification based on modeling of gene regulatory networks

[...]

Ilias Tagkopoulos¹, Nikolai Slavov¹, S.Y. Kung¹•Institutions (1)

Princeton University¹

19 Oct 2005

TL;DR: This paper derives six distinct correlation functions based on explicit thermodynamic modeling of gene regulatory networks and combines these correlation functions with novel biclustering algorithms to identify functionally enriched groups.

...read moreread less

Abstract: The attempt to elucidate biological pathways and classify genes has led to the development of numerous clustering approaches to gene expression. All these approaches use a single metric to identify genes with similar expression levels. Until now, the correlation between the expression levels of such genes has been based on phenomenological and heuristic correlation functions, rather than on biological models. In this paper, we derive six distinct correlation functions based on explicit thermodynamic modeling of gene regulatory networks. We then combine these correlation functions with novel biclustering algorithms to identify functionally enriched groups. The statistical significance of the identified groups is demonstrated by precision-recall curves and calculated p-values. Furthermore, comparison with chromatin immunoprecipitation data indicates that the performance of the derived correlation functions depends on the specific regulatory mechanisms. Finally, we introduce the idea of multi-class biclustering and with the help of support vector machines we demonstrate its improved classification performance in a microarray dataset.

...read moreread less

Proceedings Article•DOI•

Feature selection and combination criteria for improving predictive accuracy in protein structure classification

[...]

Chun-Yuan Lin¹, Ken-Li Lin², Chuen-Der Huang, Hsiu-Ming Chang¹, Chiao Yun Yang¹, Chin-Teng Lin², Chuan Yi Tang¹, D.F. Hsu³ - Show less +4 more•Institutions (3)

National Tsing Hua University¹, National Chiao Tung University², Fordham University³

19 Oct 2005

TL;DR: A combinatorial fusion analysis technique is used to facilitate feature selection and combination for improving predictive accuracy in protein structure classification and has an overall prediction accuracy rate of 87% for four classes and 69.6% for 27 folding categories.

...read moreread less

Abstract: The classification of protein structures is essential for their function determination in bioinformatics. The success of the protein structure classification depends on two factors: the computational methods used and the features selected. In this paper, we use a combinatorial fusion analysis technique to facilitate feature selection and combination for improving predictive accuracy in protein structure classification. When applying these criteria to our previous work, the resulting classification has an overall prediction accuracy rate of 87% for four classes and 69.6% for 27 folding categories. These rates are significantly higher than our previous work and demonstrate that combinatorial fusion is a valuable method for protein structure classification.

...read moreread less

Proceedings Article•DOI•

Finding LPRs in DNA sequences based on a new index - SUA

[...]

Di Wang¹, Guoren Wang¹, Qingquan Wu¹, Baichen Chen¹•Institutions (1)

Northeastern University (China)¹

19 Oct 2005

TL;DR: A lightweight index structure, namely, the succeeding unit array (the SUA) is designed based on pattern unit, which decreases the space consumption efficiently and solves the space bottleneck in the search of repetitions.

...read moreread less

Abstract: This paper proposes a new concept of repetitions, the largest pattern repetition (the LPR) and a concept of pattern unit. A lightweight index structure, namely, the succeeding unit array (the SUA) is designed based on pattern unit. The SUA decreases the space consumption efficiently and solves the space bottleneck in the search of repetitions. On the SUA all the atomic patterns which constitute the LPRs can be detected and the LPRs can be identified by connecting the same patterns. The theoretical analysis and experimental results show that both space and time complexity of the algorithms is O(n).

...read moreread less

Proceedings Article•DOI•

A two-step approach for clustering proteins based on protein interaction profile

[...]

Pengjun Pe¹, Aidong Zhang¹•Institutions (1)

State University of New York System¹

19 Oct 2005

TL;DR: A probabilistic model to define the similarity based on conditional probabilities is proposed and a two-step method for estimating the similarity between two proteins based on protein interaction profile is proposed.

...read moreread less

Abstract: High-throughput methods for detecting protein-protein interactions (PPI) have given researchers an initial global picture of protein interactions on a genomic scale. The huge data sets generated by such experiments pose new challenges in data analysis. Though clustering methods have been successfully applied in many areas in bioinformatics many clustering algorithms cannot be readily applied on protein interaction data sets. One main problem is that the similarity between two proteins cannot be easily defined. This paper proposes a probabilistic model to define the similarity based on conditional probabilities. We then propose a two-step method for estimating the similarity between two proteins based on protein interaction profile. In the first step, the model is trained with proteins with known annotation. Based on this model, similarities are calculated in the second step. Experiments show that our method improves performance.

...read moreread less

Proceedings Article•DOI•

A novel quartet-based method for phylogenetic inference

[...]

Bing Bing Zhou¹, M. Tarawneh¹, Chen Wang¹, Albert Y. Zomaya¹•Institutions (1)

University of Sydney¹

19 Oct 2005

TL;DR: The experimental results show that the probability for the correct tree to be among a very small number of trees constructed using the method is very high, which opens a new research direction to further investigate more efficient algorithms for phylogenetic inference.

...read moreread less

Abstract: In this paper we introduce a new quartet-based method This method makes use of the Bayes (or quartet) weights of quartets as those used in the quartet puzzling However, all the weights from the related quartets are accumulated to form a global quartet weight matrix This matrix provides integrated information and can lead us to recursively merge small sub-trees to larger ones until the final single tree is obtained The experimental results show that the probability for the correct tree to be among a very small number of trees constructed using our method is very high These significant results open a new research direction to further investigate more efficient algorithms for phylogenetic inference

...read moreread less

Proceedings Article•DOI•

A parallel algorithm for the constrained multiple sequence alignment problem

[...]

Dan He, Abdullah N. Arslan

19 Oct 2005

TL;DR: A parallel algorithm for the constrained multiple sequence alignment (CMSA) problem that seeks an optimal multiple alignment constrained to include a given pattern and is faster in general than the existing sequential dynamic programming solutions.

...read moreread less

Abstract: We propose a parallel algorithm for the constrained multiple sequence alignment (CMSA) problem that seeks an optimal multiple alignment constrained to include a given pattern. We consider the dynamic programming computations in layers indexed by the symbols of the given pattern. In each layer we compute as a potential part of an optimal alignment for the CMSA problem, shortest paths for multiple sources and multiple destinations. These shortest paths problems are independent from one another (which enables parallel execution), and each can be solved using an A* algorithm specialized for the shortest paths problem for multiple sources and multiple destinations. The final step of our algorithm solves a single source single destination shortest path problem. Our experiments on real sequences show that our algorithm is faster in general than the existing sequential dynamic programming solutions.

...read moreread less

Proceedings Article•DOI•

Prediction of outer membrane proteins by support vector machines using combinations of gapped amino acid pair compositions

[...]

Ssu-Hua Huang¹, Ru-Sheng Liu¹, Chien-Yu Chen¹, Ya-Ting Chao¹, Shu-Yuan Chen¹ - Show less +1 more•Institutions (1)

Yuan Ze University¹

19 Oct 2005

TL;DR: This paper proposes a method for discriminating outer membrane proteins from other proteins by support vector machines using combinations of gapped amino acid pair compositions that outperforms the OM classifier of PSORTb v.2.0 and a method based on dipeptide composition.

...read moreread less

Abstract: Discriminating outer membrane proteins from proteins with other subcellular localizations and with other folding classes are both important to predict farther their functions and structures. In this paper, we propose a method for discriminating outer membrane proteins from other proteins by support vector machines using combinations of gapped amino acid pair compositions. Using 5-fold cross-validation, the method achieves 95% precision and 92% recall on the dataset of proteins with well-annotated subcellular localizations, consisting of 471 outer membrane proteins and 1,120 other proteins. When applied on another dataset of 377 outer membrane proteins and 674 globular proteins belonging to four typical structural classes, the method reaches 96% precision and recall and correctly excludes 98% of the globular proteins. Our method outperforms the OM classifier of PSORTb v.2.0 and a method based on dipeptide composition.

...read moreread less

Proceedings Article•DOI•

A model-free and stable gene selection in microarray data analysis

[...]

Kun Yang¹, Jianzhong Li¹, Zhipeng Cai², Guohui Lin²•Institutions (2)

Harbin Institute of Technology¹, University of Alberta²

19 Oct 2005

TL;DR: A novel model-free and stable gene selection method is proposed in this paper, which does not assume any statistical model on the gene expression data and it is not affected by the unbalanced samples.

...read moreread less

Abstract: Microarray data analysis is notorious for involving a huge number of genes compared to a relatively small number of samples. Detecting the most significantly differentially expressed genes under different conditions, or gene selection, has been a central focus for researchers. The gene selection problem becomes more difficult when the numbers of samples under different conditions vary significantly, or are unbalanced. A novel model-free and stable gene selection method is proposed in this paper, i.e., the method does not assume any statistical model on the gene expression data and it is not affected by the unbalanced samples. The method has been evaluated on two publicly available datasets, the leukemia dataset and the small round blue cell tumor dataset, where the experimental results showed that the proposed method is efficient and robust in identifying differentially expressed genes.

...read moreread less

Proceedings Article•DOI•

DSIM: A distance-based indexing method for genomic sequences

[...]

Xia Cao¹, Beng Chin Ooi¹, Anthony K. H. Tung¹, HweeHwa Pang¹, Kian-LeeTan¹ - Show less +1 more•Institutions (1)

National University of Singapore¹

19 Oct 2005

TL;DR: A Distance-based Sequence Indexing Method (DSIM) for indexing and searching genome databases is proposed, borrowing the idea of video compression, which achieves significantly faster response time than BLAST, while maintaining comparable accuracy.

...read moreread less

Abstract: In this paper, we propose a Distance-based Sequence Indexing Method (DSIM) for indexing and searching genome databases. Borrowing the idea of video compression, we compress the genomic sequence database around a set of automatically selected reference words, formed from high-frequency data substrings and substrings in past queries. The compression captures the distance of each non-reference word in the database to some reference word. At runtime, a query is processed by comparing its substrings with the compressed data strings, through their distances to the reference words. We also propose an efficient scheme to incrementally update the reference words and the compressed data sequences as more data sequences are added and new queries come along. Extensive experiments on a human genome database with 2.62 GB of DNA sequence letters show that the new algorithm achieves significantly faster response time than BLAST, while maintaining comparable accuracy.

...read moreread less

Proceedings Article•DOI•

Improved phylogenetic motif detection using parsimony

[...]

Usman Roshan¹, Dennis R. Livesay², David La²•Institutions (2)

New Jersey Institute of Technology¹, California State Polytechnic University, Pomona²

19 Oct 2005

TL;DR: This investigation identifies phylogenetic motifs using heuristic maximum parsimony trees and shows that when using parsimony the functional site prediction accuracy of PMs improves substantially, particularly on divergent datasets.

...read moreread less

Abstract: We have recently demonstrated (La et al, Proteins, 58:2005) that sequence fragments approximating the overall familial phylogeny, called phylogenetic motifs (PMs), represent a promising protein functional site prediction strategy. Previous results across a structurally and functionally diverse dataset indicate that phylogenetic motifs correspond to a wide variety of known functional characteristics. Phylogenetic motifs are detected using a sliding window algorithm that compares neighbor joining trees on the complete alignment to those on the sequence fragments. In this investigation we identify PMs using heuristic maximum parsimony trees. We show that when using parsimony the functional site prediction accuracy of PMs improves substantially, particularly on divergent datasets. We also show that the new PMs found using parsimony are not necessarily conserved in sequence, and, therefore, would not be detected by traditional motif (information content-based) approaches.

...read moreread less

Proceedings Article•DOI•

A Holter of low complexity design using mixed signal processor

[...]

Chin-Tang Hsieh¹, Guang-Lin Hsieh¹, E. Lai¹, Zong-Ting Hsieh¹, Guo-Ming Hong¹ - Show less +1 more•Institutions (1)

Tamkang University¹

19 Oct 2005

TL;DR: The MSP is used to implement a finite impulse response (FIR) filter which is equiripple design which integrates the ringed buffer for the input samples and the symmetrical characteristic of the FIR filter for efficiently computing convolution.

...read moreread less

Abstract: A low power, portable, and easily implemented Holter recorder is necessary for patients or researchers of electrocardiogram (ECG). Such a Holter recorder with off-the-shelf components is realized with mixed signal processor (MSP) in this paper. To decrease the complexity of analog circuits and the interference of 60 Hz noise from power line, we use the MSP to implement a finite impulse response (FIR) filter which is equiripple design. We also integrate the ringed buffer for the input samples and the symmetrical characteristic of the FIR filter for efficiently computing convolution. The experimental results show that the output ECG signal with the PQRST feature is easy to be distinguished. This ECG signal is recorded for 24 hr using a SD card. Furthermore, the ECG signal is transmitted with a smartphone via Bluetooth to decrease the burden of the Holier recorder.

...read moreread less

Proceedings Article•DOI•

Highly scalable and accurate seeds for subsequence alignment

[...]

Abhijit Pol¹, Tamer Kahveci¹•Institutions (1)

University of Florida¹

19 Oct 2005

TL;DR: A dynamic index to store the fingerprints of k-grams and a highly scalable and accurate (HSA) algorithm to incorporate randomization into process of seed generation.

...read moreread less

Abstract: We propose a method for finding seeds for the local alignment of two nucleotide sequences. Our method uses randomized algorithms to find approximate seeds. We present a dynamic index to store the fingerprints of k-grams and a highly scalable and accurate (HSA) algorithm to incorporate randomization into process of seed generation. Experimental results show that our method produces better quality seeds with improved running time and memory usage compared to traditional non-spaced and spaced seeds. The presented algorithm scales very well with higher seed lengths while maintaining the quality and performance.

...read moreread less

Proceedings Article•DOI•

RNA pseudoknot prediction using term rewriting

[...]

X. Z. Fu¹, H. Wang¹, William L. Harrison², Robert W. Harrison¹•Institutions (2)

Georgia State University¹, University of Missouri²

19 Oct 2005

TL;DR: A new RNA pseudoknot prediction method based on term rewriting rather than on dynamic programming, comparative sequence analysis, or context-free grammars is presented, indicating that term rewriting has a broad potential in RNA applications.

...read moreread less

Abstract: RNA plays a critical role in mediating every step of cellular information transfer from genes to functional proteins. Pseudoknots are widely occurring structural motifs found in all types of RNA and are also functionally important. Therefore predicting their structures is an important problem. In this paper, we present a new RNA pseudoknot prediction method based on term rewriting rather than on dynamic programming, comparative sequence analysis, or context-free grammars. The method we describe is implemented using the Mfold RNA/DNA folding package and the term rewriting language Maude. Our method was tested on 211 pseudoknots in PseudoBase and achieves an average accuracy of 74.085% compared to the experimentally determined structure. In fact, most pseudoknots discovered by our method achieve an accuracy of above 90%. These results indicate that term rewriting has a broad potential in RNA applications from prediction of pseudoknots to higher level RNA structures involving complex RNA tertiary interactions.

...read moreread less

Proceedings Article•DOI•

Normalization of microarray data by iterative nonlinear regression

[...]

Jianhua Xuan¹, Eric P. Hoffman², Robert Clarke³, Yue Wang⁴•Institutions (4)

The Catholic University of America¹, Children's National Medical Center², Georgetown University³, Virginia Tech⁴

19 Oct 2005

TL;DR: A novel normalization approach is presented that exploits concurrent identification of invariantly expressed genes (IEGs) and implementation of nonlinear regression normalization and shows a superior performance in achieving low expression variance across replicates and excellent fold change preservation.

...read moreread less

Abstract: Normalization is an important prerequisite for almost all follow-up microarray data analysis steps. Accurate normalization assures a common base for comparative biomedical studies using gene expression profiles across different experiments and phenotypes. In this paper, we present a novel normalization approach - iterative nonlinear regression (INR) method - that exploits concurrent identification of invariantly expressed genes (IEGs) and implementation of nonlinear regression normalization. We demonstrate the principle and performance of the INR approach on two real microarray data sets. As compared to major peer methods (e.g., linear regression method, Loess method and iterative ranking method), INR method shows a superior performance in achieving low expression variance across replicates and excellent fold change preservation.

...read moreread less

Proceedings Article•DOI•

Using data mining techniques to learn layouts of flat-file biological datasets

[...]

Kaushik Sinha¹, Xuan Zhang¹, Ruoming Jin¹, Gagan Agrawal¹•Institutions (1)

Ohio State University¹

19 Oct 2005

TL;DR: This paper presents a method which uses contrast analysis on the frequency of sequences to identify delimiters for optional fields and help complete the layout descriptions of biological datasets.

...read moreread less

Abstract: One of the major problems in biological data integration is that many data sources are stored as atlasses, with a variety of different layouts. Integrating data from such sources can be an extremely time-consuming task. We have been developing data mining techniques to help learn the layout of a dataset in a semi-automatic way. In this paper, we focus on the problem of identifying delimiters for optional fields. Since these fields do not occur in every record, frequency based methods are not able to identify the corresponding delimiters. We present a method which uses contrast analysis on the frequency of sequences to identify such delimiters and help complete the layout descriptions. We demonstrate the effectiveness of this technique using three atlasses biological datasets.

...read moreread less

Proceedings Article•DOI•

A wrapper induction application with knowledge base support: a use case for initiation and maintenance of wrappers

[...]

Zina Ben Miled¹, A. Farooq¹, Malika Mahoui¹, Nianhua Li¹, M. Dippold¹, Omran Bukhres¹ - Show less +2 more•Institutions (1)

Indiana University – Purdue University Indianapolis¹

19 Oct 2005

TL;DR: By using a wrapper induction system for creation and maintenance of wrappers, scalability, flexibility, and stability of the integrated information system is easily maintained.

...read moreread less

Abstract: Integrating life science Web databases, while important and necessary, is a challenge for current integration systems mainly due to the large number of these databases, their heterogeneity and the fact that their interfaces may change often. BACIIS, a biological and chemical information integration system, is a tightly coupled federated database system that uses the mediator wrapper method in order to retrieve information from several remote Web databases. BACIIS relies on a semi-automated approach for generating and maintaining wrappers in order to provide a scalable system with a limited maintenance overhead. The semi-automatic wrapper induction in BACIIS is efficient because it is based on, but not limited to a domain knowledge. Tests show that the use of ontology increases the accuracy of the wrapper induction. We also present how the wrapper induction system facilitates wrapper update, and assists in the information extraction. By using a wrapper induction system for creation and maintenance of wrappers, scalability, flexibility, and stability of the integrated information system is easily maintained.

...read moreread less