scispace - formally typeset
Search or ask a question

Showing papers presented at "International Conference on Bioinformatics in 2006"


Journal Article•DOI•
18 Dec 2006
TL;DR: No individual method had a sufficient level of sensitivity across both evaluation sets that would enable reliable application to hypothetical proteins, and all methods showed lower performance on the LOCATE dataset and variable performance on individual subcellular localizations was observed.
Abstract: Background: Determination of the subcellular location of a protein is essential to understanding its biochemical function. This information can provide insight into the function of hypothetical or novel proteins. These data are difficult to obtain experimentally but have become especially important since many whole genome sequencing projects have been finished and many resulting protein sequences are still lacking detailed functional information. In order to address this paucity of data, many computational prediction methods have been developed. However, these methods have varying levels of accuracy and perform differently based on the sequences that are presented to the underlying algorithm. It is therefore useful to compare these methods and monitor their performance. Results: In order to perform a comprehensive survey of prediction methods, we selected only methods that accepted large batches of protein sequences, were publicly available, and were able to predict localization to at least nine of the major subcellular locations (nucleus, cytosol, mitochondrion, extracellular region, plasma membrane, Golgi apparatus, endoplasmic reticulum (ER), peroxisome, and lysosome). The selected methods were CELLO, MultiLoc, Proteome Analyst, pTarget and WoLF PSORT. These methods were evaluated using 3763 mouse proteins from SwissProt that represent the source of the training sets used in development of the individual methods. In addition, an independent evaluation set of 2145 mouse proteins from LOCATE with a bias towards the subcellular localization underrepresented in SwissProt was used. The sensitivity and specificity were calculated for each method and compared to a theoretical value based on what might be observed by random chance. Conclusion: No individual method had a sufficient level of sensitivity across both evaluation sets that would enable reliable application to hypothetical proteins. All methods showed lower performance on the LOCATE dataset and variable performance on individual subcellular localizations was observed. Proteins localized to the secretory pathway were the most difficult to predict, while nuclear and extracellular proteins were predicted with the highest sensitivity.

67 citations


Proceedings Article•DOI•
10 Nov 2006
TL;DR: This paper describes how natural language processing and text mining techniques were implemented on the transcribed verbal descriptions from retinal experts of biomedical disease features to generate feature-attribute pairs that were incorporated within a user interface for a collaborative ontology development tool.
Abstract: The use of text mining and natural language processing can extend into the realm of knowledge acquisition and management for biomedical applications. In this paper, we describe how we implemented natural language processing and text mining techniques on the transcribed verbal descriptions from retinal experts of biomedical disease features. The feature-attribute pairs generated were then incorporated within a user interface for a collaborative ontology development tool. This tool, IDOCS, is being used in the biomedical domain to help retinal specialists reach a consensus on a common ontology for describing age-related macular degeneration (AMD). We compare the use of traditional text mining and natural language processing techniques with that of a retinal specialist's analysis and discuss how we might integrate these techniques for future biomedical ontology and user interface development.

21 citations


Proceedings Article•DOI•
28 May 2006
TL;DR: A study on simulated data produced by a mathematical model of cell cycle control in budding yeast confirmed the robustness of the linear model and its suitability for a first level, genome-wide analysis of high throughput dynamic data.
Abstract: Dynamic Bayesian networks offer a powerful modeling tool to unravel cellular mechanisms. In particular, Linear Gaussian Networks allow researchers to avoid information loss associated with discretization and render the learning process computationally tractable even for hundreds of variables. Yet, are linear models suitable to learn the complex dynamic interactions among genes and proteins? We here present a study on simulated data produced by a mathematical model of cell cycle control in budding yeast: the results obtained confirmed the robustness of the linear model and its suitability for a first level, genome-wide analysis of high throughput dynamic data.

18 citations


Proceedings Article•DOI•
28 May 2006
TL;DR: A novel DSP model is proposed that clearly explains the intricate operation of the DNA spectrum, allows the derivation of new DNA spectrum expressions which, in turn, generalize and unify previous work and suggests an efficient way to improve the detection of protein coding regions by computing a filtered spectrum.
Abstract: Many signal processing techniques have been introduced in the past to identify the protein coding regions by detecting the so-called period-3 component in the DNA spectrum. However, a solid understanding of this observed phenomenon and its underlying mechanism from a DSP perspective has been missing from the literature. We therefore propose a novel DSP model that i) clearly explains the intricate operation of the DNA spectrum, ii) allows the derivation of new DNA spectrum expressions which, in turn, generalize and unify previous work and iii) suggests an efficient way to improve the detection of protein coding regions by computing a filtered spectrum.

17 citations


Proceedings Article•DOI•
28 May 2006
TL;DR: Simulation results using synthetic data and microarray measurements show the effectiveness of the proposed scheme, where genetic programming is applied to identify the structure of the model and Kalman filtering is employed to estimate the parameters in each iteration.
Abstract: In this paper, gene regulatory networks are infered through evolutionary modeling and time-series microarray measurements. A nonlinear differential equation model is adopted and an iterative algorithm is proposed to identify the model, where genetic programming is applied to identify the structure of the model and Kalman filtering is employed to estimate the parameters in each iteration. Simulation results using synthetic data and microarray measurements show the effectiveness of the proposed scheme.

16 citations


Proceedings Article•DOI•
10 Nov 2006
TL;DR: Correlate summation analysis (CSA) provides an alternative method to yield important variables and data clustering which appear to be comparable to PCA and may indicate an underlying mechanism affecting two variables.
Abstract: Principal Component Analysis (PCA) can identify key variables among complex data sets, but requires special statistical packages. Alternatively, Discrete and Aggregate Correlate Summation (DCΣ/ACΣ) identifies the most covariant variables relative to mean clustering for grouped and individual data, respectively. We compared these analyses regarding the influence of estrogen and salt on the hypertensive phenotype of the female mRen2.Lewis strain. DCΣx compares changes in correlation between two groups for each variable versus all of the others, relative to mean shift. ACΣx determines which variable has the highest total correlation to the other variables, relative to its mean reduced (normalized) standard deviation (nSD). To compare correlate summation to PCA, the absolute weights of the first principal component (EVEC1x were multiplied by the nSD as in ACΣx. Nine variables including proteinuria, serum ACE, plasma Ang II, renin, heart weight to body weight ratio and systolic blood pressure were analyzed with respect to normal and high salt diets, as well as to estrogen intact and depleted conditions. DCΣx results for both arms of the study were significantly correlated with EVEC1x (r=0.72, p

14 citations


Proceedings Article•DOI•
Slobodan Vucetic1•
10 Nov 2006
TL;DR: An algorithm is proposed that omits stemming and, instead, uses the most discriminative substrings as attributes in classification, which is particularly useful when labeled datasets are small.
Abstract: Attribute selection is a critical step in development of document classification systems. As a standard practice, words are stemmed and the most informative ones are used as attributes in classification. Due to high complexity of biomedical terminology, general-purpose stemming algorithms are often conservative and could also remove informative stems. This can lead to accuracy reduction, especially when the number of labeled documents is small. To address this issue, we proposed [1] an algorithm that omits stemming and, instead, uses the most discriminative substrings as attributes. The approach was tested on five annotated sets of abstracts from iProLINK that report on the experimental evidence about five types of protein post-translational modifications. The experiments showed that Naive Bayes and Support Vector Machine classifiers perform consistently better (with Area Under the ROC Curve (AUC) accuracy in range 0.92-0.97) when using the proposed attribute selection than when using attributes obtained by the Porter stemmer algorithm (AUC in 0.86-0.93 range). The proposed approach is particularly useful when labeled datasets are small.

12 citations


Proceedings Article•DOI•
28 May 2006
TL;DR: Experimental results of Terahertz spectroscopy of several different DNA samples show that the EMD aids the clustering process and yields clustering of higher validity than that obtained from the raw data.
Abstract: DNA sequence analysis has been widely studied by gene-expression microarray techniques. Few results, however, have been provided by Terahertz spectroscopy which reveals the absorbtion or reflectance percentage from different DNA sequences. Previous Terahertz methods have lacked a quantitative analysis of the spectroscopy features, and no definitive conclusion regarding the data can be easily drawn. In this paper, we use a signal processing approach which gives a quantitative interpretation of the DNA spectroscopy. Due to the presence of physical noise, the data can be contaminated by both random fluctuations and impulsive noise. A new signal processing tool called empirical mode decomposition (EMD) is employed to remove the noise and extract the trend of the signal. The data is subsequently partitioned by clustering methods. Experimental results of Terahertz spectroscopy of several different DNA samples show that the EMD aids the clustering process and yields clustering of higher validity than that obtained from the raw data.

11 citations


Proceedings Article•
01 Mar 2006

11 citations


Proceedings Article•DOI•
28 May 2006
TL;DR: The results indicate that the proposed method is effective in identifying DNA copy number changes from the microarray comparative genomic hybridization (aCGH) profile.
Abstract: Cancer development is usually associated with DNA copy number changes in the genome. DNA copy number changes correspond to chromosomal aberrations and signify abnormality of a cell. Therefore, identifying statistically significant DNA copy number changes is evidently crucial in cancer research, clinical diagnostic applications, and other related genomic research. The problem can be formulated with a statistical change point theory. We propose to use the mean and variance change point model to study the DNA copy number changes from the microarray comparative genomic hybridization (aCGH) profile. The approximate p-value of identifying a change point is derived from the use of Schwarz information criterion (SIC). The proposed method has been validated by Monte-Carlo simulation and applications to aCGH profiles from several cell lines (fibroblast cancer cell line, breast tumor cell line, and breast cancer cell line). The results indicate that the proposed method is effective in identifying DNA copy number changes.

10 citations


Proceedings Article•DOI•
28 May 2006
TL;DR: A novel approach that combines fuzzy clustering with multiscale feature selection to improve the accuracy of classifying M-FISH images is introduced and will improve the reliability of M- FISH imaging technique in identifying subtle and cryptic genetic aberrations for cancer diagnosis and genetic research.
Abstract: Multi-color or multiplex fluorescence in situ hybridization (M-FISH) imaging is a recently developed molecular cytogenetic diagnosis technique for rapid visualization of genomic aberrations at the chromosomal level. The reliability of the technique depends primarily on the accurate pixel-wise classification. In the paper we introduce a novel approach that combines fuzzy clustering with multiscale feature selection to improve the accuracy of classifying M-FISH images. A multiscale principal component analysis (MPCA) was proposed to reduce the redundancy between multi-channel images. In comparison with conventional PCA, it offers adaptive redundancy reduction. The algorithms have been tested on an M-FISH image database, demonstrating the improvement in the classification accuracy. The increased accuracy of pixel-wise classification will improve the reliability of M-FISH imaging technique in identifying subtle and cryptic genetic aberrations for cancer diagnosis and genetic research.

Proceedings Article•DOI•
10 Nov 2006
TL;DR: This research explores thresholding of SVM scores, the relationship of performance to hierarchy level and to the number of positives in the training sets, and finds that hierarchy level is important especially for the molecular function and biological process hierarchies.
Abstract: Annotating genes and their products with Gene Ontology codes is an important area of research. One approach for doing this is to use the information available about these genes in the biomedical literature. Our goal, based on this approach, is to develop automatic methods for annotation that could supplement the expensive manual annotation processes currently in place. Using a set of Support Vector Machines (SVM) classifiers we were able to achieve Fscores of 0.48, 0.4 and 0.32 for codes of the molecular function, cellular component and biological process GO hierarchies respectively. We explore thresholding of SVM scores, the relationship of performance to hierarchy level and to the number of positives in the training sets. We find that hierarchy level is important especially for the molecular function and biological process hierarchies. We find that the cellular component hierarchy stands apart from the other two in many respects. This may be due to fundamental differences in link semantics. This research also exploits the hierarchical structures by defining and testing a relaxed criteria for classification correctness.

Proceedings Article•DOI•
28 May 2006
TL;DR: A modeling approach based on Probabilistic Boolean Networks for the inference of genetic regulatory networks from gene expression time-course data in different biological conditions i.e. making use of the information contained in sets of genes and the interaction between genes rather than single-gene analyses.
Abstract: We propose a modeling approach based on Probabilistic Boolean Networks for the inference of genetic regulatory networks from gene expression time-course data in different biological conditions i.e. making use of the information contained in sets of genes and the interaction between genes rather than single-gene analyses. This model is a collection of traditional Probabilistic Boolean Networks. We also present an approach which is based on constrained prediction and Coefficient of Determination (COD) for the identification of the model from gene expression data. The modeling approach is applied in the context of pathway biology to the analysis of gene interaction networks.

Proceedings Article•DOI•
10 Nov 2006
TL;DR: A way to incorporate a priori knowledge of gene relationships into LSI/SVD and NMF and a gene retrieval method based on NMF (GR/NMF), which shows comparable performance with latent semantic indexing based on SVD.
Abstract: The construction of literature-based networks of gene-gene interactions is one of the most important applications of text mining in bioinformatics. Extracting potential gene relationships from the biomedical literature may be helpful in building biological hypotheses that can be explored further experimentally. In this paper, we explore the utility of singular value decomposition (SVD) and non-negative matrix factorization (NMF) to extract unrecognized gene relationships from the biomedical literature by taking advantage of known gene relationships. We introduce a way to incorporate a priori knowledge of gene relationships into LSI/SVD and NMF. In addition, we propose a gene retrieval method based on NMF (GR/NMF), which shows comparable performance with latent semantic indexing based on SVD.

Proceedings Article•DOI•
10 Nov 2006
TL;DR: It is found that most systems have some difficulty in detecting definitions for chemical/gene/protein symbols where ALICE has relatively better performance of chemical/Gene/ protein symbols comparing to the other two possibly due to fine tuning of the system for those symbols.
Abstract: With more and more research dedicated to literature mining in the biomedical domain, more and more systems are available for people to choose from to build literature mining applications. In this study, we focus on one specific kind of task, i.e., detecting definitions of acronyms/abbreviations/symbols in biomedical text. The study was designed to answer the following questions; i) how well a system performs in detecting definitions when provided with a large set of documents recently published in the biomedical domain, ii) what the coverage is for various knowledge bases in including acronyms/abbreviations/symbols as synonyms of their definitions, and iii) how to combine results from various systems. We evaluated three publicly available systems, namely, ALICE (a handcrafted pattern/rule based system), a system by Chang et al. (a machine-learning system), and an algorithm by Schwartz and Hearst (a simple alignment-based program), in detecting definitions for acronyms/abbreviations/symbols as well as the conceptual coverage of existing thesauri, namely, the UMLS (the Unified Medical Language System) and the BioThesaurus (a thesaurus of names for all UniProt protein records). We found that all three systems agreed on a large portion of the results (over 94% of all definitions detected) mainly due to the fact that most acronyms/abbreviations/symbols were formed through various initializations from their definitions. The precisions and recalls of the three systems are comparable. However, based on manual investigation of the results, we found that most systems have some difficulty in detecting definitions for chemical/gene/protein symbols where ALICE has relatively better performance of chemical/gene/protein symbols comparing to the other two possibly due to fine tuning of the system for those symbols. We also found existing knowledge bases have a good coverage of definitions for those frequently defined acronyms/abbreviations/symbols. Potential combinations of the three systems were also discussed and implemented.

Proceedings Article•
01 Jan 2006
TL;DR: It is common that certain incorrect trees can have likelihood values at least as large as that of the correct tree, suggesting that even if the authors are able to find a truly globally optimal tree under the maximum likelihood criterion, this tree may not necessarily be the correct phylogenetic tree.
Abstract: Recently we developed a new quartet-based algorithm for phylogenetic analysis [22]. This algorithm constructs a limited number of trees for a given set of DNA or protein sequences and the initial experimental results show that the probability for the correct tree to be included in this small set of trees is very high. In this paper we further extend the idea. We first discuss a revision to the original algorithm to reduce the number of trees generated, while keeping the high probability for the correct tree to be included. We then deal with the issue on how to retrieve the correct tree from the generated trees and our current approach is to calculate the likelihood values of these trees and pick up a few best ones which have the highest likelihood values. Though the experimental results are comparable to that obtained from currently popular ML based algorithms, we find that it is common that certain incorrect trees can have likelihood values at least as large as that of the correct tree. A significant implication of this is that even if we are able to find a truly globally optimal tree under the maximum likelihood criterion, this tree may not necessarily be the correct phylogenetic tree!

Proceedings Article•DOI•
10 Nov 2006
TL;DR: These efforts towards defining Bmp2 gene expression determinants are described by combining functional assays with computational analyses of emerging genome data and suggest that both primary sequence and more subtle parameters such as nucleotide composition control BMP2 expression at the post-transcriptional level.
Abstract: Comparing genomes from diverse species has revealed surprisingly high conservation between proteins of vastly different organisms, e.g. 75% of pufferfish proteins have human counterparts. Thus subtle variation in the expression of master developmental control genes, like Bone Morphogenetic Protein (Bmp)2, is central to species differentiation. Understanding the evolution of the complex transcriptional and post-transcriptional mechanisms required to precisely regulate such genes requires novel, interdisciplinary approaches. We describe here our efforts towards defining Bmp2 gene expression determinants by combining functional assays with computational analyses of emerging genome data. Our results suggest that both primary sequence and more subtle parameters such as nucleotide composition control Bmp2 expression at the post-transcriptional level.

Journal Article•DOI•
01 Dec 2006
TL;DR: Concerns with the validation steps are showcased that serve as a cautionary note and indicate the heightened need for careful selection of analytic and companion validation methods in investigations involving high-dimensional predictors with complex between-feature dependencies.
Abstract: In a recent article in PLoS Genetics, Bock et al., (2006) undertake an extensive computational epigenetics analysis of the ability of DNA sequence-derived features, capturing attributes such as tetramer frequencies, repeats and predicted structure, to predict the methylation status of CpG islands. Their suite of analyses appears highly rigorous with regard to accompanying validation procedures, employing stringent Bonferroni corrections, stratified cross-validation, and follow-up experimental verification. Here, however, we showcase concerns with the validation steps, in part ascribable to the genome scale of the investigation, that serve as a cautionary note and indicate the heightened need for careful selection of analytic and companion validation methods. A series of new analyses of the same CpG island methylation data helps illustrate these issues, not just for this particular study, but also analogous investigations involving high-dimensional predictors with complex between-feature dependencies.

Proceedings Article•DOI•
28 May 2006
TL;DR: This paper focuses on short initial exons and presents a method to improve the detection of these short coding regions, based on the weight array method (WAM) and CpG islands.
Abstract: There are many gene prediction programs available and while the accuracy of these programs has increased significantly over the last few years, the accurate identification of short exons remains very poor. In this paper we concentrate on short initial exons and present a method to improve the detection of these short coding regions. The algorithm is based on the weight array method (WAM) and CpG islands. The algorithm was evaluated on a total of 158 sequences containing short initial exons, and achieves an accuracy of up to 73%. By comparison with GENSCAN, the proposed WAM-CpG Island algorithm reveals an improvement of up to 22%.

Proceedings Article•DOI•
28 May 2006
TL;DR: The Hemagglutinin gene in human and avian isolates of the influenza type A, subtype H5N1, virus is compared and the method works well for the study of small genomic sequences, such as in the genomes of viruses and bacteria.
Abstract: The conversion of symbolic nucleotide sequences into digital signals allows applying signal processing methods to analyze genomic data. The method works well for the study of small genomic sequences, such as in the genomes of viruses and bacteria, and is adequate for monitoring their variability and tracking the development of drug resistance. The paper is based on data downloaded from NIH GenBank, and compares the Hemagglutinin (HA) gene in human and avian isolates of the influenza type A, subtype H5N1, virus.

Proceedings Article•DOI•
Doheon Lee1, Sangwoo Kim1, Younghoon Kim1•
10 Nov 2006
TL;DR: The whole architecture of BioCAD and essential modules for bio-network inference and analysis are presented and an effective technique to elucidate network edges by integrating various information sources is presented.
Abstract: As systems biology has begun to draw growing attention, bio-network inference and analysis have become more and more important. Though there have been many efforts for bio-network inference, they are still far from practical applications due to too many false inferences and lack of comprehensible interpretation in the biological viewpoints. In order for applying to real problems, they should provide effective inference, reliable validation, rational elucidation, and sufficient extensibility to incorporate various relevant information sources. To address these requirements, we have been developing an information fusion software platform called BioCAD. It is utilizing both of local and global optimization for bio-network inference, text mining techniques for network validation and annotation, and Web services-based workflow techniques. In addition, it includes an effective technique to elucidate network edges by integrating various information sources. This paper presents the whole architecture of BioCAD and essential modules for bio-network inference and analysis.

Proceedings Article•DOI•
28 May 2006
TL;DR: A new algorithm for gene mapping is proposed which treats the data using partial least squares regression and then locates the causal markers by cross model validation, and results obtained show their compliance with the ones obtained by standard techniques, yet more accuracy is achieved, showing another application of multi-variate data analysis to the problem of human genetics.
Abstract: Identifying the causal genetic markers responsible for certain phenotypes is a main aim in human genetics. In the context of complex diseases, which are believed to have multiple causal loci of largely unknown effects and positions, it is essential to formulate general yet accurate methods for gene mapping. In this direction of research, a new algorithm for gene mapping is proposed which treats the data using partial least squares regression and then locates the causal markers by cross model validation. Results obtained show their compliance with the ones obtained by standard techniques, yet more accuracy is achieved; hence, showing another application of multi-variate data analysis to the problem of human genetics.

Proceedings Article•DOI•
28 May 2006
TL;DR: The ability of a modified biclustering technique combined with sensitivity analysis of gene expression levels to identify all potential biomarkers found by prior studies as well as several more promising candidates that had been missed in the literature are shown.
Abstract: The NIH/NCI estimates that one out of 57 women will develop ovarian cancer during their lifetime. Ovarian cancer is 90 percent curable when detected early. Unfortunately, many cases of ovarian cancer are not diagnosed until advanced stages because most women do not develop noticeable symptoms. This paper presents an exhaustive identification of all potential biomarkers for the diagnosis of early-stage and/or recurrent ovarian cancer using a unique and comprehensive set of gene expression data. The data set was generated by Gene Logic Inc. from ovarian normal and cancerous tissues as well as non-ovarian tissues collected at the University of Minnesota by Skubitz et al. In particular, the paper shows the ability of a modified biclustering technique combined with sensitivity analysis of gene expression levels to identify all potential biomarkers found by prior studies as well as several more promising candidates that had been missed in the literature. Furthermore, unlike most prior studies, this work screens all candidate biomarkers using two additional techniques: immunohistochemical analysis and reverse transcriptase polymerase chain reaction.

Proceedings Article•DOI•
Maria Avino1•
28 May 2006
TL;DR: This paper studies finite dynamical systems with n functions acting on the same set X, and probabilities assigned to these functions, and develops the concepts of homomorphism and e-homomorphism of probabilistic regulatory networks, since these concepts bring the properties from one networks to another.
Abstract: In this paper we study finite dynamical systems with n functions acting on the same set X, and probabilities assigned to these functions, that it is called probabilistic regulatory gene networks (PRN) in [3]. This concept is the same or a natural generalization of the concept probabilistic Boolean networks (PBN), introduced by I. Shmulevich, E. Dougherty, and W. Zhang in [5], Particularly the model PBN has been using to describe genetic networks and has therapeutic applications, see [6]. In PRN the most important question is to describe the steady states of the systems, so in this paper we pay attention to the idea of transforming a network to another without lost all the properties, in particular the probability distribution. Following this objective we develop the concepts of homomorphism and e-homomorphism of probabilistic regulatory networks, since these concepts bring the properties from one networks to another. Projections are special homomorphisms, and they always induce invariant subnetworks that contain cycles and steady states.

Proceedings Article•DOI•
28 May 2006
TL;DR: The proposed method can be applied to the prediction of novel regulatory RNAs in genome sequences and give rise to complex correlations in the primary sequence of the RNA.
Abstract: Recent research on gene regulation has revealed that many non-coding RNAs (ncRNAs) are actively involved in controlling various gene-regulatory networks. For such ncRNAs, their secondary structures play crucial roles in carrying out their functions. Interestingly enough, many regulatory RNAs can choose from two alternative structures based on external factors, which enables the RNAs to regulate the expression of certain genes in an environment-dependent manner. The existence of alternative structures give rise to complex correlations in the primary sequence of the RNA. In this paper, we propose an efficient method for modeling alternative secondary structures in regulatory RNAs. The proposed method can be applied to the prediction of novel regulatory RNAs in genome sequences.

Proceedings Article•DOI•
10 Nov 2006
TL;DR: The experimental results show the approach is superior to traditional approaches including Bisecting K-means as a leading document clustering approach in terms of cluster quality and clustering reliability and provides concise but rich text summary in key concepts and sentences.
Abstract: We introduce a method that integrates biomedical literature clustering and summarization using biomedical ontology. The core of the approach is to identify document cluster models as semantic chunks capturing the core semantic relationships in the ontology-enriched scale-free graphical representation of documents. These document cluster models are used for both document clustering on document assignment and text summarization on the construction of Text Semantic Interaction Network (TSIN). Our experimental results show our approach is superior to traditional approaches including Bisecting K-means as a leading document clustering approach in terms of cluster quality and clustering reliability. In addition, our approach provides concise but rich text summary in key concepts and sentences.

Proceedings Article•DOI•
28 May 2006
TL;DR: All the biclusters discovered with the proposed methodology have no imperfections and, the complexity of the algorithm is shown to be lower than that of previous approaches.
Abstract: In this paper, we describe an approach for finding all order preserving genes biclusters from a set of DNA microarray experimental data that combines the algorithm that finds biclusters with constant values on columns developed in one of our previous study with an adaptive gene expression level quantization procedure. All the biclusters discovered with the proposed methodology have no imperfections and, the complexity of the algorithm is shown to be lower than that of previous approaches. Application of the method to ovarian cancer data seems to reveal significant local pattern.

Proceedings Article•DOI•
28 May 2006
TL;DR: An approach to subsequence identification based on 'purity functions' derived from state transition tables, to be used in conjunction with a method for the identification of predictor genes and functions is proposed.
Abstract: This paper presents a new method of fitting probabilistic Boolean networks (PBNs) to time-course state data. The critical issue to be addressed is to identify the contributions of the PBN's constituent Boolean networks in a sequence of temporal data. The sequence must be partitioned into sections, each corresponding to a single model with fixed parameters. We propose an approach to subsequence identification based on 'purity functions' derived from state transition tables, to be used in conjunction with a method for the identification of predictor genes and functions. We also present the estimation of the network switching probability, selection probabilities, perturbation rate, as well as observations on the inference of input genes, predictor functions and their relation with the length of the observed data sequence.

Proceedings Article•
01 Jan 2006
TL;DR: A DNA implementation of an arbitrary finite state machine is developed that determines whether the cell has a specific disease based on the presence or absence of indicator mRNA molecules and releases a proper drug for treatment.
Abstract: We propose a technique to diagnose and treat individual cells in the human body. A virus-like system delivers a copy of a diagnosis and treatment DNA complex to each cell. The complex determines whether the cell has a specific disease based on the presence or absence of indicator mRNA molecules and, if the diagnosis is positive, releases a proper drug for treatment. As a tool for the diagnosis and treatment system, we develop a DNA implementation of an arbitrary finite state machine.

Proceedings Article•DOI•
28 May 2006
TL;DR: The result is an encoder which has excellent compression efficiency on annotated genome sequences, provides instantaneous access to functional elements in the file, and thus it serves as a basis for further applications, such as indexing and searching for specified feature entries.
Abstract: This article investigates the efficiency of randomly accessible coding for annotated genome files and compares it to universal coding. The result is an encoder which has excellent compression efficiency on annotated genome sequences, provides instantaneous access to functional elements in the file, and thus it serves as a basis for further applications, such as indexing and searching for specified feature entries.