Showing papers presented at "International Conference on Bioinformatics in 2006"

PDF

Open Access

Journal Article•DOI•

Evaluation and comparison of mammalian subcellular localization prediction methods.

[...]

Josefine Sprenger¹, J. Lynn Fink¹, Rohan D. Teasdale¹•Institutions (1)

18 Dec 2006

TL;DR: No individual method had a sufficient level of sensitivity across both evaluation sets that would enable reliable application to hypothetical proteins, and all methods showed lower performance on the LOCATE dataset and variable performance on individual subcellular localizations was observed.

...read moreread less

Abstract: Background: Determination of the subcellular location of a protein is essential to understanding its biochemical function. This information can provide insight into the function of hypothetical or novel proteins. These data are difficult to obtain experimentally but have become especially important since many whole genome sequencing projects have been finished and many resulting protein sequences are still lacking detailed functional information. In order to address this paucity of data, many computational prediction methods have been developed. However, these methods have varying levels of accuracy and perform differently based on the sequences that are presented to the underlying algorithm. It is therefore useful to compare these methods and monitor their performance. Results: In order to perform a comprehensive survey of prediction methods, we selected only methods that accepted large batches of protein sequences, were publicly available, and were able to predict localization to at least nine of the major subcellular locations (nucleus, cytosol, mitochondrion, extracellular region, plasma membrane, Golgi apparatus, endoplasmic reticulum (ER), peroxisome, and lysosome). The selected methods were CELLO, MultiLoc, Proteome Analyst, pTarget and WoLF PSORT. These methods were evaluated using 3763 mouse proteins from SwissProt that represent the source of the training sets used in development of the individual methods. In addition, an independent evaluation set of 2145 mouse proteins from LOCATE with a bias towards the subcellular localization underrepresented in SwissProt was used. The sensitivity and specificity were calculated for each method and compared to a theoretical value based on what might be observed by random chance. Conclusion: No individual method had a sufficient level of sensitivity across both evaluation sets that would enable reliable application to hypothetical proteins. All methods showed lower performance on the LOCATE dataset and variable performance on individual subcellular localizations was observed. Proteins localized to the secretory pathway were the most difficult to predict, while nuclear and extracellular proteins were predicted with the highest sensitivity.

...read moreread less

67 citations

Proceedings Article•DOI•

Towards applying text mining and natural language processing for biomedical ontology acquisition

[...]

Tasha R. Inniss¹, John R. Lee, Marc Light, Michael A. Grassi², George Thomas³, Andrew B. Williams¹ - Show less +2 more•Institutions (3)

Spelman College¹, University of Chicago², University of Iowa³

10 Nov 2006

TL;DR: This paper describes how natural language processing and text mining techniques were implemented on the transcribed verbal descriptions from retinal experts of biomedical disease features to generate feature-attribute pairs that were incorporated within a user interface for a collaborative ontology development tool.

...read moreread less

Abstract: The use of text mining and natural language processing can extend into the realm of knowledge acquisition and management for biomedical applications. In this paper, we describe how we implemented natural language processing and text mining techniques on the transcribed verbal descriptions from retinal experts of biomedical disease features. The feature-attribute pairs generated were then incorporated within a user interface for a collaborative ontology development tool. This tool, IDOCS, is being used in the biomedical domain to help retinal specialists reach a consensus on a common ontology for describing age-related macular degeneration (AMD). We compare the use of traditional text mining and natural language processing techniques with that of a retinal specialist's analysis and discuss how we might integrate these techniques for future biomedical ontology and user interface development.

...read moreread less

21 citations

Proceedings Article•DOI•

Can we use linear Gaussian networks to model dynamic interactions among genes? Results from a simulation study

[...]

Fulvia Ferrazzi, R. Amici, Paola Sebastiani¹, Isaac S. Kohane², Marco F. Ramoni², Riccardo Bellazzi - Show less +2 more•Institutions (2)

Boston University¹, Harvard University²

28 May 2006

TL;DR: A study on simulated data produced by a mathematical model of cell cycle control in budding yeast confirmed the robustness of the linear model and its suitability for a first level, genome-wide analysis of high throughput dynamic data.

...read moreread less

Abstract: Dynamic Bayesian networks offer a powerful modeling tool to unravel cellular mechanisms. In particular, Linear Gaussian Networks allow researchers to avoid information loss associated with discretization and render the learning process computationally tractable even for hundreds of variables. Yet, are linear models suitable to learn the complex dynamic interactions among genes and proteins? We here present a study on simulated data produced by a mathematical model of cell cycle control in budding yeast: the results obtained confirmed the robustness of the linear model and its suitability for a first level, genome-wide analysis of high throughput dynamic data.

...read moreread less

18 citations

Proceedings Article•DOI•

A DSP perspective to the period-3 detection problem

[...]

J. Tuqan¹, Ahmad A. Rushdi¹•Institutions (1)

University of California, Davis¹

28 May 2006

TL;DR: A novel DSP model is proposed that clearly explains the intricate operation of the DNA spectrum, allows the derivation of new DNA spectrum expressions which, in turn, generalize and unify previous work and suggests an efficient way to improve the detection of protein coding regions by computing a filtered spectrum.

...read moreread less

Abstract: Many signal processing techniques have been introduced in the past to identify the protein coding regions by detecting the so-called period-3 component in the DNA spectrum. However, a solid understanding of this observed phenomenon and its underlying mechanism from a DSP perspective has been missing from the literature. We therefore propose a novel DSP model that i) clearly explains the intricate operation of the DNA spectrum, ii) allows the derivation of new DNA spectrum expressions which, in turn, generalize and unify previous work and iii) suggests an efficient way to improve the detection of protein coding regions by computing a filtered spectrum.

...read moreread less

17 citations

Proceedings Article•DOI•

Inference of gene regulatory networks using genetic programming and Kalman filter

[...]

Haixin Wang¹, Lijun Qian¹, Edward R. Dougherty²•Institutions (2)

Prairie View A&M University¹, Texas A&M University²

28 May 2006

TL;DR: Simulation results using synthetic data and microarray measurements show the effectiveness of the proposed scheme, where genetic programming is applied to identify the structure of the model and Kalman filtering is employed to estimate the parameters in each iteration.

...read moreread less

Abstract: In this paper, gene regulatory networks are infered through evolutionary modeling and time-series microarray measurements. A nonlinear differential equation model is adopted and an iterative algorithm is proposed to identify the model, where genetic programming is applied to identify the structure of the model and Kalman filtering is employed to estimate the parameters in each iteration. Simulation results using synthetic data and microarray measurements show the effectiveness of the proposed scheme.

...read moreread less

16 citations

Proceedings Article•DOI•

Application of correlate summation to data clustering in the estrogen- and salt-sensitive female mRen2.Lewis rat

[...]

Brian M. Westwood¹, Mark C. Chappell¹•Institutions (1)

Wake Forest University¹

10 Nov 2006

TL;DR: Correlate summation analysis (CSA) provides an alternative method to yield important variables and data clustering which appear to be comparable to PCA and may indicate an underlying mechanism affecting two variables.

...read moreread less

Abstract: Principal Component Analysis (PCA) can identify key variables among complex data sets, but requires special statistical packages. Alternatively, Discrete and Aggregate Correlate Summation (DCΣ/ACΣ) identifies the most covariant variables relative to mean clustering for grouped and individual data, respectively. We compared these analyses regarding the influence of estrogen and salt on the hypertensive phenotype of the female mRen2.Lewis strain. DCΣx compares changes in correlation between two groups for each variable versus all of the others, relative to mean shift. ACΣx determines which variable has the highest total correlation to the other variables, relative to its mean reduced (normalized) standard deviation (nSD). To compare correlate summation to PCA, the absolute weights of the first principal component (EVEC1x were multiplied by the nSD as in ACΣx. Nine variables including proteinuria, serum ACE, plasma Ang II, renin, heart weight to body weight ratio and systolic blood pressure were analyzed with respect to normal and high salt diets, as well as to estrogen intact and depleted conditions. DCΣx results for both arms of the study were significantly correlated with EVEC1x (r=0.72, p

...read moreread less

14 citations

Proceedings Article•DOI•

Substring selection for biomedical document classification

[...]

Slobodan Vucetic¹•Institutions (1)

Temple University¹

10 Nov 2006

TL;DR: An algorithm is proposed that omits stemming and, instead, uses the most discriminative substrings as attributes in classification, which is particularly useful when labeled datasets are small.

...read moreread less

Abstract: Attribute selection is a critical step in development of document classification systems. As a standard practice, words are stemmed and the most informative ones are used as attributes in classification. Due to high complexity of biomedical terminology, general-purpose stemming algorithms are often conservative and could also remove informative stems. This can lead to accuracy reduction, especially when the number of labeled documents is small. To address this issue, we proposed [1] an algorithm that omits stemming and, instead, uses the most discriminative substrings as attributes. The approach was tested on five annotated sets of abstracts from iProLINK that report on the experimental evidence about five types of protein post-translational modifications. The experiments showed that Naive Bayes and Support Vector Machine classifiers perform consistently better (with Area Under the ROC Curve (AUC) accuracy in range 0.92-0.97) when using the proposed attribute selection than when using attributes obtained by the Porter stemmer algorithm (AUC in 0.86-0.93 range). The proposed approach is particularly useful when labeled datasets are small.

...read moreread less

12 citations

Proceedings Article•DOI•

Empirical mode decomposition as a tool for DNA sequence analysis from terahertz spectroscopy measurements

[...]

Binwei Weng¹, G. Xuan¹, James Kolodzey¹, Kenneth E. Barner¹•Institutions (1)

University of Delaware¹

28 May 2006

TL;DR: Experimental results of Terahertz spectroscopy of several different DNA samples show that the EMD aids the clustering process and yields clustering of higher validity than that obtained from the raw data.

...read moreread less

Abstract: DNA sequence analysis has been widely studied by gene-expression microarray techniques. Few results, however, have been provided by Terahertz spectroscopy which reveals the absorbtion or reflectance percentage from different DNA sequences. Previous Terahertz methods have lacked a quantitative analysis of the spectroscopy features, and no definitive conclusion regarding the data can be easily drawn. In this paper, we use a signal processing approach which gives a quantitative interpretation of the DNA spectroscopy. Due to the presence of physical noise, the data can be contaminated by both random fluctuations and impulsive noise. A new signal processing tool called empirical mode decomposition (EMD) is employed to remove the noise and extract the trend of the signal. The data is subsequently partitioned by clustering methods. Experimental results of Terahertz spectroscopy of several different DNA samples show that the EMD aids the clustering process and yields clustering of higher validity than that obtained from the raw data.

...read moreread less

11 citations

Proceedings Article•

To BI or not to BI

[...]

Miquel Barceló Garcia

01 Mar 2006

11 citations

Proceedings Article•DOI•

Detection of dna copy number changes using statistical change point analysis

[...]

Jie Chen¹, Yu-Ping Wang¹•Institutions (1)

University of Missouri–Kansas City¹

28 May 2006

TL;DR: The results indicate that the proposed method is effective in identifying DNA copy number changes from the microarray comparative genomic hybridization (aCGH) profile.

...read moreread less

Abstract: Cancer development is usually associated with DNA copy number changes in the genome. DNA copy number changes correspond to chromosomal aberrations and signify abnormality of a cell. Therefore, identifying statistically significant DNA copy number changes is evidently crucial in cancer research, clinical diagnostic applications, and other related genomic research. The problem can be formulated with a statistical change point theory. We propose to use the mean and variance change point model to study the DNA copy number changes from the microarray comparative genomic hybridization (aCGH) profile. The approximate p-value of identifying a change point is derived from the use of Schwarz information criterion (SIC). The proposed method has been validated by Monte-Carlo simulation and applications to aCGH profiles from several cell lines (fibroblast cancer cell line, breast tumor cell line, and breast cancer cell line). The results indicate that the proposed method is effective in identifying DNA copy number changes.

...read moreread less

10 citations

Proceedings Article•DOI•

Classification of multi-spectral florescence in situ hybridization images with fuzzy clustering and multiscale feature selection

[...]

Yu-Ping Wang¹, Ashok Kumar Dandpat¹•Institutions (1)

University of Missouri–Kansas City¹

28 May 2006

TL;DR: A novel approach that combines fuzzy clustering with multiscale feature selection to improve the accuracy of classifying M-FISH images is introduced and will improve the reliability of M- FISH imaging technique in identifying subtle and cryptic genetic aberrations for cancer diagnosis and genetic research.

...read moreread less

Abstract: Multi-color or multiplex fluorescence in situ hybridization (M-FISH) imaging is a recently developed molecular cytogenetic diagnosis technique for rapid visualization of genomic aberrations at the chromosomal level. The reliability of the technique depends primarily on the accurate pixel-wise classification. In the paper we introduce a novel approach that combines fuzzy clustering with multiscale feature selection to improve the accuracy of classifying M-FISH images. A multiscale principal component analysis (MPCA) was proposed to reduce the redundancy between multi-channel images. In comparison with conventional PCA, it offers adaptive redundancy reduction. The algorithms have been tested on an M-FISH image database, demonstrating the improvement in the classification accuracy. The increased accuracy of pixel-wise classification will improve the reliability of M-FISH imaging technique in identifying subtle and cryptic genetic aberrations for cancer diagnosis and genetic research.

...read moreread less

Proceedings Article•DOI•

GO for gene documents

[...]

Xin Ying Qiu¹, Padmini Srinivasan¹•Institutions (1)

University of Iowa¹

10 Nov 2006

TL;DR: This research explores thresholding of SVM scores, the relationship of performance to hierarchy level and to the number of positives in the training sets, and finds that hierarchy level is important especially for the molecular function and biological process hierarchies.

...read moreread less

Abstract: Annotating genes and their products with Gene Ontology codes is an important area of research. One approach for doing this is to use the information available about these genes in the biomedical literature. Our goal, based on this approach, is to develop automatic methods for annotation that could supplement the expensive manual annotation processes currently in place. Using a set of Support Vector Machines (SVM) classifiers we were able to achieve Fscores of 0.48, 0.4 and 0.32 for codes of the molecular function, cellular component and biological process GO hierarchies respectively. We explore thresholding of SVM scores, the relationship of performance to hierarchy level and to the number of positives in the training sets. We find that hierarchy level is important especially for the molecular function and biological process hierarchies. We find that the cellular component hierarchy stands apart from the other two in many respects. This may be due to fundamental differences in link semantics. This research also exploits the hierarchical structures by defining and testing a relaxed criteria for classification correctness.

...read moreread less

Proceedings Article•DOI•

Modelling of macrophage gene expression in the interferon pathway

[...]

Le Yu¹, Stephen Marshall¹, Thorsten Forster², Peter Ghazal²•Institutions (2)

University of Strathclyde¹, University of Edinburgh²

28 May 2006

TL;DR: A modeling approach based on Probabilistic Boolean Networks for the inference of genetic regulatory networks from gene expression time-course data in different biological conditions i.e. making use of the information contained in sets of genes and the interaction between genes rather than single-gene analyses.

...read moreread less

Abstract: We propose a modeling approach based on Probabilistic Boolean Networks for the inference of genetic regulatory networks from gene expression time-course data in different biological conditions i.e. making use of the information contained in sets of genes and the interaction between genes rather than single-gene analyses. This model is a collection of traditional Probabilistic Boolean Networks. We also present an approach which is based on constrained prediction and Coefficient of Determination (COD) for the identification of the model from gene expression data. The modeling approach is applied in the context of pathway biology to the analysis of gene interaction networks.

...read moreread less

Proceedings Article•DOI•

Extracting unrecognized gene relationships from the biomedical literature via matrix factorizations using a priori knowledge of gene relationships

[...]

Hyun-Chul Kim¹, Haesun Park¹•Institutions (1)

Georgia Institute of Technology¹

10 Nov 2006

TL;DR: A way to incorporate a priori knowledge of gene relationships into LSI/SVD and NMF and a gene retrieval method based on NMF (GR/NMF), which shows comparable performance with latent semantic indexing based on SVD.

...read moreread less

Abstract: The construction of literature-based networks of gene-gene interactions is one of the most important applications of text mining in bioinformatics. Extracting potential gene relationships from the biomedical literature may be helpful in building biological hypotheses that can be explored further experimentally. In this paper, we explore the utility of singular value decomposition (SVD) and non-negative matrix factorization (NMF) to extract unrecognized gene relationships from the biomedical literature by taking advantage of known gene relationships. We introduce a way to incorporate a priori knowledge of gene relationships into LSI/SVD and NMF. In addition, we propose a gene retrieval method based on NMF (GR/NMF), which shows comparable performance with latent semantic indexing based on SVD.

...read moreread less

Proceedings Article•DOI•

A comparison study of biomedical short form definition detection algorithms

[...]

Manabu Torii¹, Hongfang Liu¹, Zhang-Zhi Hu¹, Cathy H. Wu¹•Institutions (1)

Georgetown University Medical Center¹

10 Nov 2006

TL;DR: It is found that most systems have some difficulty in detecting definitions for chemical/gene/protein symbols where ALICE has relatively better performance of chemical/Gene/ protein symbols comparing to the other two possibly due to fine tuning of the system for those symbols.

...read moreread less

Abstract: With more and more research dedicated to literature mining in the biomedical domain, more and more systems are available for people to choose from to build literature mining applications. In this study, we focus on one specific kind of task, i.e., detecting definitions of acronyms/abbreviations/symbols in biomedical text. The study was designed to answer the following questions; i) how well a system performs in detecting definitions when provided with a large set of documents recently published in the biomedical domain, ii) what the coverage is for various knowledge bases in including acronyms/abbreviations/symbols as synonyms of their definitions, and iii) how to combine results from various systems. We evaluated three publicly available systems, namely, ALICE (a handcrafted pattern/rule based system), a system by Chang et al. (a machine-learning system), and an algorithm by Schwartz and Hearst (a simple alignment-based program), in detecting definitions for acronyms/abbreviations/symbols as well as the conceptual coverage of existing thesauri, namely, the UMLS (the Unified Medical Language System) and the BioThesaurus (a thesaurus of names for all UniProt protein records). We found that all three systems agreed on a large portion of the results (over 94% of all definitions detected) mainly due to the fact that most acronyms/abbreviations/symbols were formed through various initializations from their definitions. The precisions and recalls of the three systems are comparable. However, based on manual investigation of the results, we found that most systems have some difficulty in detecting definitions for chemical/gene/protein symbols where ALICE has relatively better performance of chemical/gene/protein symbols comparing to the other two possibly due to fine tuning of the system for those symbols. We also found existing knowledge bases have a good coverage of definitions for those frequently defined acronyms/abbreviations/symbols. Potential combinations of the three systems were also discussed and implemented.

...read moreread less

Proceedings Article•

On a new quartet-based phylogeny reconstruction algorithm

[...]

Bing Bing Zhou, M Tarawmeh, D. Chu, Penghao Wang, Chen Wang, Albert Y. Zomaya, Richard P. Brent - Show less +3 more

01 Jan 2006

TL;DR: It is common that certain incorrect trees can have likelihood values at least as large as that of the correct tree, suggesting that even if the authors are able to find a truly globally optimal tree under the maximum likelihood criterion, this tree may not necessarily be the correct phylogenetic tree.

...read moreread less

Abstract: Recently we developed a new quartet-based algorithm for phylogenetic analysis [22]. This algorithm constructs a limited number of trees for a given set of DNA or protein sequences and the initial experimental results show that the probability for the correct tree to be included in this small set of trees is very high. In this paper we further extend the idea. We first discuss a revision to the original algorithm to reduce the number of trees generated, while keeping the high probability for the correct tree to be included. We then deal with the issue on how to retrieve the correct tree from the generated trees and our current approach is to calculate the likelihood values of these trees and pick up a few best ones which have the highest likelihood values. Though the experimental results are comparable to that obtained from currently popular ML based algorithms, we find that it is common that certain incorrect trees can have likelihood values at least as large as that of the correct tree. A significant implication of this is that even if we are able to find a truly globally optimal tree under the maximum likelihood criterion, this tree may not necessarily be the correct phylogenetic tree!

...read moreread less

Proceedings Article•DOI•

Using emerging genome data to identify conserved bone morphogenetic protein (Bmp) 2 gene expression mechanisms

[...]

Jun Hu¹, Bin Tian¹, David T. Fritz¹, Melissa B. Rogers¹•Institutions (1)

Rutgers University¹

10 Nov 2006

TL;DR: These efforts towards defining Bmp2 gene expression determinants are described by combining functional assays with computational analyses of emerging genome data and suggest that both primary sequence and more subtle parameters such as nucleotide composition control BMP2 expression at the post-transcriptional level.

...read moreread less

Abstract: Comparing genomes from diverse species has revealed surprisingly high conservation between proteins of vastly different organisms, e.g. 75% of pufferfish proteins have human counterparts. Thus subtle variation in the expression of master developmental control genes, like Bone Morphogenetic Protein (Bmp)2, is central to species differentiation. Understanding the evolution of the complex transcriptional and post-transcriptional mechanisms required to precisely regulate such genes requires novel, interdisciplinary approaches. We describe here our efforts towards defining Bmp2 gene expression determinants by combining functional assays with computational analyses of emerging genome data. Our results suggest that both primary sequence and more subtle parameters such as nucleotide composition control Bmp2 expression at the post-transcriptional level.

...read moreread less

Journal Article•DOI•

Validation in Genomics: CpG Island Methylation Revisited

[...]

Mark R. Segal¹•Institutions (1)

University of California, San Francisco¹

01 Dec 2006

TL;DR: Concerns with the validation steps are showcased that serve as a cautionary note and indicate the heightened need for careful selection of analytic and companion validation methods in investigations involving high-dimensional predictors with complex between-feature dependencies.

...read moreread less

Abstract: In a recent article in PLoS Genetics, Bock et al., (2006) undertake an extensive computational epigenetics analysis of the ability of DNA sequence-derived features, capturing attributes such as tetramer frequencies, repeats and predicted structure, to predict the methylation status of CpG islands. Their suite of analyses appears highly rigorous with regard to accompanying validation procedures, employing stringent Bonferroni corrections, stratified cross-validation, and follow-up experimental verification. Here, however, we showcase concerns with the validation steps, in part ascribable to the genome scale of the investigation, that serve as a cautionary note and indicate the heightened need for careful selection of analytic and companion validation methods. A series of new analyses of the same CpG island methylation data helps illustrate these issues, not just for this particular study, but also analogous investigations involving high-dimensional predictors with complex between-feature dependencies.

...read moreread less

Proceedings Article•DOI•

A method for detecting short initial exons

[...]

Sayanthan Logeswaran¹, Eliathamby Ambikairajah¹, Julien Epps¹•Institutions (1)

University of New South Wales¹

28 May 2006

TL;DR: This paper focuses on short initial exons and presents a method to improve the detection of these short coding regions, based on the weight array method (WAM) and CpG islands.

...read moreread less

Abstract: There are many gene prediction programs available and while the accuracy of these programs has increased significantly over the last few years, the accurate identification of short exons remains very poor. In this paper we concentrate on short initial exons and present a method to improve the detection of these short coding regions. The algorithm is based on the weight array method (WAM) and CpG islands. The algorithm was evaluated on a total of 158 sequences containing short initial exons, and achieves an accuracy of up to 73%. By comparison with GENSCAN, the proposed WAM-CpG Island algorithm reveals an improvement of up to 22%.

...read moreread less

Proceedings Article•DOI•

Genomic signal analysis: Study of pathogen variability

[...]

Paul Dan Cristea¹•Institutions (1)

University of Bucharest¹

28 May 2006

TL;DR: The Hemagglutinin gene in human and avian isolates of the influenza type A, subtype H5N1, virus is compared and the method works well for the study of small genomic sequences, such as in the genomes of viruses and bacteria.

...read moreread less

Abstract: The conversion of symbolic nucleotide sequences into digital signals allows applying signal processing methods to analyze genomic data. The method works well for the study of small genomic sequences, such as in the genomes of viruses and bacteria, and is adequate for monitoring their variability and tracking the development of drug resistance. The paper is based on data downloaded from NIH GenBank, and compares the Hemagglutinin (HA) gene in human and avian isolates of the influenza type A, subtype H5N1, virus.

...read moreread less

Proceedings Article•DOI•

BioCAD: an information fusion platform for bio-network inference and analysis

[...]

Doheon Lee¹, Sangwoo Kim¹, Younghoon Kim¹•Institutions (1)

KAIST¹

10 Nov 2006

TL;DR: The whole architecture of BioCAD and essential modules for bio-network inference and analysis are presented and an effective technique to elucidate network edges by integrating various information sources is presented.

...read moreread less

Abstract: As systems biology has begun to draw growing attention, bio-network inference and analysis have become more and more important. Though there have been many efforts for bio-network inference, they are still far from practical applications due to too many false inferences and lack of comprehensible interpretation in the biological viewpoints. In order for applying to real problems, they should provide effective inference, reliable validation, rational elucidation, and sufficient extensibility to incorporate various relevant information sources. To address these requirements, we have been developing an information fusion software platform called BioCAD. It is utilizing both of local and global optimization for bio-network inference, text mining techniques for network validation and annotation, and Web services-based workflow techniques. In addition, it includes an effective technique to elucidate network edges by integrating various information sources. This paper presents the whole architecture of BioCAD and essential modules for bio-network inference and analysis.

...read moreread less

Proceedings Article•DOI•

A new algorithm for gene mapping: Application of partial least squares regression with cross model validation

[...]

Michel Sarkis¹, Klaus Diepold¹, F. Westad•Institutions (1)

Ludwig Maximilian University of Munich¹

28 May 2006

TL;DR: A new algorithm for gene mapping is proposed which treats the data using partial least squares regression and then locates the causal markers by cross model validation, and results obtained show their compliance with the ones obtained by standard techniques, yet more accuracy is achieved, showing another application of multi-variate data analysis to the problem of human genetics.

...read moreread less

Abstract: Identifying the causal genetic markers responsible for certain phenotypes is a main aim in human genetics. In the context of complex diseases, which are believed to have multiple causal loci of largely unknown effects and positions, it is essential to formulate general yet accurate methods for gene mapping. In this direction of research, a new algorithm for gene mapping is proposed which treats the data using partial least squares regression and then locates the causal markers by cross model validation. Results obtained show their compliance with the ones obtained by standard techniques, yet more accuracy is achieved; hence, showing another application of multi-variate data analysis to the problem of human genetics.

...read moreread less

Proceedings Article•DOI•

Uncovering potential biomarkers in ovarian carcinoma via biclustering of DNA microarray data

[...]

Alain B. Tchagang¹, Ahmed H. Tewfik¹, Amy P.N. Skubitz¹, Keith M. Skubitz•Institutions (1)

University of Minnesota¹

28 May 2006

TL;DR: The ability of a modified biclustering technique combined with sensitivity analysis of gene expression levels to identify all potential biomarkers found by prior studies as well as several more promising candidates that had been missed in the literature are shown.

...read moreread less

Abstract: The NIH/NCI estimates that one out of 57 women will develop ovarian cancer during their lifetime. Ovarian cancer is 90 percent curable when detected early. Unfortunately, many cases of ovarian cancer are not diagnosed until advanced stages because most women do not develop noticeable symptoms. This paper presents an exhaustive identification of all potential biomarkers for the diagnosis of early-stage and/or recurrent ovarian cancer using a unique and comprehensive set of gene expression data. The data set was generated by Gene Logic Inc. from ovarian normal and cancerous tissues as well as non-ovarian tissues collected at the University of Minnesota by Skubitz et al. In particular, the paper shows the ability of a modified biclustering technique combined with sensitivity analysis of gene expression levels to identify all potential biomarkers found by prior studies as well as several more promising candidates that had been missed in the literature. Furthermore, unlike most prior studies, this work screens all candidate biomarkers using two additional techniques: immunohistochemical analysis and reverse transcriptase polymerase chain reaction.

...read moreread less

Proceedings Article•DOI•

Homomorphisms of probabilistic gene regulatory networks

[...]

Maria Avino¹•Institutions (1)

University of Puerto Rico¹

28 May 2006

TL;DR: This paper studies finite dynamical systems with n functions acting on the same set X, and probabilities assigned to these functions, and develops the concepts of homomorphism and e-homomorphism of probabilistic regulatory networks, since these concepts bring the properties from one networks to another.

...read moreread less

Abstract: In this paper we study finite dynamical systems with n functions acting on the same set X, and probabilities assigned to these functions, that it is called probabilistic regulatory gene networks (PRN) in [3]. This concept is the same or a natural generalization of the concept probabilistic Boolean networks (PBN), introduced by I. Shmulevich, E. Dougherty, and W. Zhang in [5], Particularly the model PBN has been using to describe genetic networks and has therapeutic applications, see [6]. In PRN the most important question is to describe the steady states of the systems, so in this paper we pay attention to the idea of transforming a network to another without lost all the properties, in particular the probability distribution. Following this objective we develop the concepts of homomorphism and e-homomorphism of probabilistic regulatory networks, since these concepts bring the properties from one networks to another. Projections are special homomorphisms, and they always induce invariant subnetworks that contain cycles and steady states.

...read moreread less

Proceedings Article•DOI•

Modeling and identification of alternative folding in regulatory RNAs using context-sensitive HMMS

[...]

Byung-Jun Yoon¹, P.P. Vaidyanathan¹•Institutions (1)

California Institute of Technology¹

28 May 2006

TL;DR: The proposed method can be applied to the prediction of novel regulatory RNAs in genome sequences and give rise to complex correlations in the primary sequence of the RNA.

...read moreread less

Abstract: Recent research on gene regulation has revealed that many non-coding RNAs (ncRNAs) are actively involved in controlling various gene-regulatory networks. For such ncRNAs, their secondary structures play crucial roles in carrying out their functions. Interestingly enough, many regulatory RNAs can choose from two alternative structures based on external factors, which enables the RNAs to regulate the expression of certain genes in an environment-dependent manner. The existence of alternative structures give rise to complex correlations in the primary sequence of the RNA. In this paper, we propose an efficient method for modeling alternative secondary structures in regulatory RNAs. The proposed method can be applied to the prediction of novel regulatory RNAs in genome sequences.

...read moreread less

Proceedings Article•DOI•

Integrating biomedical literature clustering and summarization approaches using biomedical ontology

[...]

Illhoi Yoo¹, Xiaohua Hu², Il-Yeol Song²•Institutions (2)

University of Missouri¹, Drexel University²

10 Nov 2006

TL;DR: The experimental results show the approach is superior to traditional approaches including Bisecting K-means as a leading document clustering approach in terms of cluster quality and clustering reliability and provides concise but rich text summary in key concepts and sentences.

...read moreread less

Abstract: We introduce a method that integrates biomedical literature clustering and summarization using biomedical ontology. The core of the approach is to identify document cluster models as semantic chunks capturing the core semantic relationships in the ontology-enriched scale-free graphical representation of documents. These document cluster models are used for both document clustering on document assignment and text summarization on the construction of Text Semantic Interaction Network (TSIN). Our experimental results show our approach is superior to traditional approaches including Bisecting K-means as a leading document clustering approach in terms of cluster quality and clustering reliability. In addition, our approach provides concise but rich text summary in key concepts and sentences.

...read moreread less

Proceedings Article•DOI•

Analysis of order preserving genes biclusters

[...]

Alain B. Tchagang¹, A.H. Tewfik¹, Amy P.N. Skubitz¹•Institutions (1)

University of Minnesota¹

28 May 2006

TL;DR: All the biclusters discovered with the proposed methodology have no imperfections and, the complexity of the algorithm is shown to be lower than that of previous approaches.

...read moreread less

Abstract: In this paper, we describe an approach for finding all order preserving genes biclusters from a set of DNA microarray experimental data that combines the algorithm that finds biclusters with constant values on columns developed in one of our previous study with an adaptive gene expression level quantization procedure. All the biclusters discovered with the proposed methodology have no imperfections and, the complexity of the algorithm is shown to be lower than that of previous approaches. Application of the method to ovarian cancer data seems to reveal significant local pattern.

...read moreread less

Proceedings Article•DOI•

Temporal inference of probabilistic Boolean networks

[...]

Stephen Marshall, Le Yu, Yufei Xiao¹, Edward R. Dougherty•Institutions (1)

Texas A&M University¹

28 May 2006

TL;DR: An approach to subsequence identification based on 'purity functions' derived from state transition tables, to be used in conjunction with a method for the identification of predictor genes and functions is proposed.

...read moreread less

Abstract: This paper presents a new method of fitting probabilistic Boolean networks (PBNs) to time-course state data. The critical issue to be addressed is to identify the contributions of the PBN's constituent Boolean networks in a sequence of temporal data. The sequence must be partitioned into sections, each corresponding to a single model with fixed parameters. We propose an approach to subsequence identification based on 'purity functions' derived from state transition tables, to be used in conjunction with a method for the identification of predictor genes and functions. We also present the estimation of the network switching probability, selection probabilities, perturbation rate, as well as observations on the inference of input genes, predictor functions and their relation with the length of the observed data sequence.

...read moreread less

Proceedings Article•

Toward in vivo disease diagnosis and treatment using DNA

[...]

Yuriy Brun, Manoj Gopalkrishnan¹•Institutions (1)

University of Southern California¹

01 Jan 2006

TL;DR: A DNA implementation of an arbitrary finite state machine is developed that determines whether the cell has a specific disease based on the presence or absence of indicator mRNA molecules and releases a proper drug for treatment.

...read moreread less

Abstract: We propose a technique to diagnose and treat individual cells in the human body. A virus-like system delivers a copy of a diagnosis and treatment DNA complex to each cell. The complex determines whether the cell has a specific disease based on the presence or absence of indicator mRNA molecules and, if the diagnosis is positive, releases a proper drug for treatment. As a tool for the diagnosis and treatment system, we develop a DNA implementation of an arbitrary finite state machine.

...read moreread less

Proceedings Article•DOI•

Random-access compression of annotated DNA sequences

[...]

G. Korodi¹, Ioan Tabus¹•Institutions (1)

Tampere University of Technology¹

28 May 2006

TL;DR: The result is an encoder which has excellent compression efficiency on annotated genome sequences, provides instantaneous access to functional elements in the file, and thus it serves as a basis for further applications, such as indexing and searching for specified feature entries.

...read moreread less

Abstract: This article investigates the efficiency of randomly accessible coding for annotated genome files and compares it to universal coding. The result is an encoder which has excellent compression efficiency on annotated genome sequences, provides instantaneous access to functional elements in the file, and thus it serves as a basis for further applications, such as indexing and searching for specified feature entries.

...read moreread less