scispace - formally typeset
Search or ask a question

Showing papers by "Satoru Miyano published in 2002"


Journal ArticleDOI
TL;DR: An improved version of Michael Eisen's well-known Cluster program for Windows, Mac OS X and Linux/Unix is created, and a Python and a Perl interface to the C Clustering Library is generated, thereby combining the flexibility of a scripting language with the speed of C.
Abstract: SUMMARY We have implemented k-means clustering, hierarchical clustering and self-organizing maps in a single multipurpose open-source library of C routines, callable from other C and C++ programs. Using this library, we have created an improved version of Michael Eisen's well-known Cluster program for Windows, Mac OS X and Linux/Unix. In addition, we generated a Python and a Perl interface to the C Clustering Library, thereby combining the flexibility of a scripting language with the speed of C. AVAILABILITY The C Clustering Library and the corresponding Python C extension module Pycluster were released under the Python License, while the Perl module Algorithm::Cluster was released under the Artistic License. The GUI code Cluster 3.0 for Windows, Macintosh and Linux/Unix, as well as the corresponding command-line program, were released under the same license as the original Cluster code. The complete source code is available at http://bonsai.ims.u-tokyo.ac.jp/mdehoon/software/cluster. Alternatively, Algorithm::Cluster can be downloaded from CPAN, while Pycluster is also available as part of the Biopython distribution.

1,493 citations


Journal ArticleDOI
TL;DR: This work has succeeded in finding rules whose prediction accuracies come close to that of TargetP, while still retaining a very simple and interpretable form.
Abstract: Motivation: The prediction of localization sites of various proteins is an important and challenging problem in the field of molecular biology. TargetP, by Emanuelsson et al. (J. Mol. Biol., 300, 1005‐1016, 2000) is a neural network based system which is currently the best predictor in the literature for N-terminal sorting signals. One drawback of neural networks, however, is that it is generally difficult to understand and interpret how and why they make such predictions. In this paper, we aim to generate simple and interpretable rules as predictors, and still achieve a practical prediction accuracy. We adopt an approach which consists of an extensive search for simple rules and various attributes which is partially guided by human intuition. Results: We have succeeded in finding rules whose prediction accuracies come close to that of TargetP, while still retaining a very simple and interpretable form. We also discuss and interpret the discovered rules. Availability: An (experimental) web service using rules obtained by our method is provided at http:

721 citations


Proceedings ArticleDOI
01 Dec 2002
TL;DR: This work proposes to infer the degree of sparseness of the gene regulatory network from the data, where Akaike's Information Criterion is used to determine which coefficients are nonzero in a linear system of differential equations.
Abstract: We describe a new method to infer a gene regulatory network, in terms of a linear system of differential equations, from time course gene expression data. As biologically the gene regulatory network is known to be sparse, we expect most coefficients in such a linear system of differential equations to be zero. In previously proposed methods, the number of nonzero coefficients in the system was limited based on ad hoc assumptions. Instead, we propose to infer the degree of sparseness of the gene regulatory network from the data, where we use Akaike's Information Criterion to determine which coefficients are nonzero. We apply our method to MMGE time course data of Bacillus subtilis.

210 citations


Book ChapterDOI
24 Nov 2002
TL;DR: This work proposes to infer the degree of sparseness of the gene regulatory network from the data, where it determines which coefficients are nonzero by using Akaike's Information Criterion.
Abstract: Spurred by advances in cDNA microarray technology, gene expression data are increasingly becoming available. In time-ordered data, the expression levels are measured at several points in time following some experimental manipulation. A gene regulatory network can be inferred by fitting a linear system of differential equations to the gene expression data. As biologically the gene regulatory network is known to be sparse, we expect most coefficients in such a linear system of differential equations to be zero. In previously proposed methods to infer such a linear system, ad hoc assumptions were made to limit the number of nonzero coefficients in the system. Instead, we propose to infer the degree of sparseness of the gene regulatory network from the data, where we determine which coefficients are nonzero by using Akaike's Information Criterion.

76 citations


Journal ArticleDOI
TL;DR: The results suggest that these eight SNPs in selectin genes may be useful for screening populations susceptible to the IgAN phenotype that involves interstitial infiltration.
Abstract: Although intensive efforts have been undertaken to elucidate the genetic background of immunoglobulin A nephropathy (IgAN), genetic factors associated with the pathogenesis of this disease are still not well understood. We designed a case-control association study that was based on linkage disequilibrium among single-nucleotide polymorphisms (SNPs) in the selectin gene cluster on chromosome 1q24-25, and we found two SNPs in the E-selectin gene (SELE8 and SELE13) and six SNPs in the L-selectin gene (SELL1, SELL4, SELL5, SELL6, SELL10, and SELL11) that were significantly associated with IgAN in Japanese patients. All eight SNPs were in almost complete linkage disequilibrium. SELE8 and SELL10 caused amino acid substitutions from His to Tyr and from Pro to Ser (χ2=9.02, P=.0026, odds ratio = 2.73 [95% confidence interval {CI} 1.38–5.38] for His-to-Tyr substitutions; χ2=17.4, P=.000031, odds ratio = 3.61 [95% CI 1.91–6.83] for Pro-to-Ser substitutions), and SELL1 could affect promoter activity of the L-selectin gene (χ2=19.5, P=.000010, odds ratio = 3.77 [95% CI 2.02–7.05]). The TGT haplotype at these three loci was associated significantly with IgAN (χ2=18.67, P=.000016, odds ratio = 1.88 [95% CI 1.41–2.51]). Our results suggest that these eight SNPs in selectin genes may be useful for screening populations susceptible to the IgAN phenotype that involves interstitial infiltration.

72 citations


Journal ArticleDOI
TL;DR: This work uses the maximum likelihood method together with Akaike's Information Criterion to fit linear splines to a small set of time-ordered gene expression data in order to infer statistically meaningful information from the measurements.
Abstract: Motivation: Recently, the temporal response of genes to changes in their environment has been investigated using cDNA microarray technology by measuring the gene expression levels at a small number of time points. Conventional techniques for time series analysis are not suitable for such a short series of time-ordered data. The analysis of gene expression data has therefore usually been limited to a fold-change analysis, instead of a systematic statistical approach. Methods: We use the maximum likelihood method together with Akaike’s Information Criterion to fit linear splines to a small set of time-ordered gene expression data in order to infer statistically meaningful information from the measurements. The significance of measured gene expression data is assessed using Student’s t-test. Results: Previous gene expression measurements of the cyanobacterium Synechocystis sp. PCC6803 were reanalyzed using linear splines. The temporal response was identified of many genes that had been missed by a fold-change analysis. Based on our statistical analysis, we found that about four gene expression measurements or more are needed at each time point. Availability: An extension module for Python to calculate linear spline functions is available at http://bonsai.ims. u-tokyo.ac.jp/∼mdehoon. This software package (with patent pending) is free of charge for academic use only.

69 citations


Proceedings ArticleDOI
01 Dec 2002
TL;DR: Simulation results suggest that parameter values representing the strength of cell-autonomous suppression of Notch signaling by Delta are essential for generating two different modes of patterning: lateral inhibition and boundary formation, which could explain how a common gene regulatory network results in two different patterning modes in vivo.
Abstract: The Delta-Notch signaling system plays an essential role in various morphogenetic systems of multicellular animal development. Here we analyzed the mechanism of Notch-dependent boundary formation in the Drosophila large intestine, by experimental manipulation of Delta expression and computational modeling and simulation by Genomic Object Net. Boundary formation representing the situation in normal large intestine was shown by the simulation. By manipulating Delta expression in the large intestine, a few types of disorder in boundary cell differentiation were observed, and similar abnormal patterns were generated by the simulation. Simulation results suggest that parameter values representing the strength of cell-autonomous suppression of Notch signaling by Delta are essential for generating two different modes of patterning: lateral inhibition and boundary formation, which could explain how a common gene regulatory network results in two different patterning modes in vivo. Genomic Object Net proved to be a useful and flexible biosimulation system that is suitable for analyzing complex biological phenomena such as patternings of multicellular systems as well as intracellular changes in cell states including metabolic activities, gene regulation, and enzyme reactions.

46 citations


Patent
26 Sep 2002
TL;DR: In this article, the authors proposed a method for the analysis of complex biological information, including gene networks, using Boolean inferential methods and Bayesian methods, for determining cause and effect relationships between expressed genes, and for determining upstream effectors of regulated genes.
Abstract: Embodiments of this invention include application of new inferential methods to analysis of complex biological information, including gene networks. In some embodiments, disruptant data and/or drug induction/inhibition data are obtained simultaneously for a number of genes in ' an organism. New methods include modifications of Boolean inferential methods and application of those methods to determining relationships between expressed genes in organisms. Additional new methods include modifications of Bayesian inferential methods and application of those methods to determining cause and effect relationships between expressed genes, and in some embodiments, for determining upstream effectors of regulated genes. Additional modifications of Bayesian methods include use of heterogeneous variance and different curve fitting methods, including spline functions, to improve estimation of graphs of networks of expressed genes. Other embodiments include the use of bootstrapping methods and determination of edge effects to more accurately provide network information between expressed genes. Methods of this invention were validated using information obtained from prior studies, as well as from newly carried out studies of gene expression.

35 citations


Proceedings ArticleDOI
14 Aug 2002
TL;DR: A new statistical method for constructing a genetic network from microarray gene expression data by using a Bayesian network is proposed and a new graph selection criterion from Bayesian approach in general situations is theoretically derived.
Abstract: We propose a new statistical method for constructing a genetic network from microarray gene expression data by using a Bayesian network. An essential point of Bayesian network construction is in the estimation of the conditional distribution of each random variable. We consider fitting nonparametric regression models with heterogeneous error variances to the microarray gene expression data to capture the nonlinear structures between genes. A problem still remains to be solved in selecting an optimal graph, which gives the best representation of the system among genes. We theoretically derive a new graph selection criterion from Bayes approach in general situations. The proposed method includes previous methods based on Bayesian networks. We demonstrate the effectiveness of the proposed method through the analysis of Saccharomyces cerevisiae gene expression data newly obtained by disrupting 100 genes.

32 citations


Journal ArticleDOI
TL;DR: The data imply that some haplotype of the HLA-DRA locus has an important role in the development of IgAN in Japanese patients, and this relationship between HLA class II genes and IgAN is investigated.
Abstract: Immunoglobulin A nephropathy (IgAN) is a form of chronic glomerulonephritis of unknown etiology and pathogenesis. Immunogenetic studies have not conclusively indicated that human leukocyte antigen (HLA) is involved. As a first step in investigating a possible relationship between HLA class II genes and IgAN, we analyzed the extent of linkage disequilibrium (LD) in this region of chromosome 6p21.3 in a Japanese test population and found extended LD blocks within the class II locus. We designed a case-control association study of single-nucleotide polymorphisms (SNPs) in each of those LD blocks, and determined that SNPs located in the HLA-DRA gene were significantly associated with an increased risk of IgAN (P = 0.000001, odds ratio = 1.91 [95% confidence interval 1.46–2.49]); SNPs in other LD blocks were not. Our data imply that some haplotype of the HLA-DRA locus has an important role in the development of IgAN in Japanese patients.

31 citations


Journal ArticleDOI
TL;DR: The method for measuring the reliability of the estimated gene network by using the bootstrap method is proposed, which shows good results in both the accuracy and the efficiency of the estimation.
Abstract: The development of the microarray technology provides us a huge amount of gene expression profiles. The estimation of a gene network has received considerable attention in the field of bioinformatics and several methodologies have been proposed such as the Boolean network [1], the Bayesian network [3, 4, 5] and so on. In this paper, we propose the method for measuring the reliability of the estimated gene network by using the bootstrap method [2].

Journal ArticleDOI
TL;DR: A new approach to pattern discovery called string pattern regression is presented, where a data set is given that consists of a string attribute and an objective numerical attribute, and an exact but efficient branch-and-bound algorithm is presented which is applicable to various pattern classes.
Abstract: We present a new approach to pattern discovery called string pattern regression, where we are given a data set that consists of a string attribute and an objective numerical attribute. The problem is to find the best string pattern that divides the data set in such a way that the distribution of the numerical attribute values of the set for which the pattern matches the string attribute, is most distinct, with respect to some appropriate measure, from the distribution of the numerical attribute values of the set for which the pattern does not match the string attribute. By solving this problem, we are able to discover, at the same time, a subset of the data whose objective numerical attributes are significantly different from rest of the data, as well as the splitting rule in the form of a string pattern that is conserved in the subset. Although the problem can be solved in linear time for the substring pattern class, the problem is NP-hard in the general case (i.e. more complex patterns), and we present an exact but efficient branch-and-bound algorithm which is applicable to various pattern classes. We apply our algorithm to intron sequences of human, mouse, fly, and zebrafish, and show the practicality of our approach and algorithm. We also discuss possible extensions of our algorithm, as well as promising applications, such as microarray gene expression data.

Journal ArticleDOI
TL;DR: A dynamic Bayesian network and nonparametric regression model for estimating a gene network with cyclic regulations from time series microarray data is proposed and a criterion for selecting a network from Bayes approach is derived.
Abstract: A Bayesian network is a powerful tool for modeling relations among a large number of random variables. Therefore the Bayesian network has received considerable attention from the studies of gene network estimation using microarray gene expression data. Imoto et al. [1, 2] proposed a Bayesian network and nonparametric regression model for capturing nonlinear relations between genes from the continuous gene expression data. However, a Bayesian network still has a problem that it cannot construct cyclic regulations, while real gene networks have cyclic regulations. For a solution of this problem, in this paper, we propose a dynamic Bayesian network and nonparametric regression model for estimating a gene network with cyclic regulations from time series microarray data. We also derive a criterion for selecting a network from Bayes approach. The effectiveness of our method is displayed though the analysis of the Saccharomyces cerevisiae gene expression data.

Proceedings ArticleDOI
Sascha Ott1, Yoshinori Tamada1, Hideo Bannai, Kenta Nakai1, Satoru Miyano1 
01 Dec 2002
TL;DR: A new computational method for the analysis of DNA sequences with respect to splicing is developed and several results are derived indicating that intrasplicing may be an appropriate model for the splicing of at least part of the long intron sequences.
Abstract: We propose a new model for the splicing of long introns, which we call intrasplicing. The basic idea of this model is that the splicing of long introns may be facilitated by the splicing of inner parts of the intron prior to the splicing of the long intron itself. Since long introns have up to about 100,000 bases, this model seems to be a likely explanation of their splicing. To investigate the possibility of this model, we develop a new computational method for the analysis of DNA sequences with respect to splicing. We analyze the genomic sequence of four species with our method and derive several results indicating that intrasplicing may be an appropriate model for the splicing of at least part of the long intron sequences.

Journal Article
TL;DR: In this article, the degree of sparseness of the gene regulatory network from the data, where they determine which coefficients are nonzero by using Akaike's Information Criterion.
Abstract: Spurred by advances in cDNA microarray technology, gene expression data are increasingly becoming available. In time-ordered data, the expression levels are measured at several points in time following some experimental manipulation. A gene regulatory network can be inferred by fitting a linear system of differential equations to the gene expression data. As biologically the gene regulatory network is known to be sparse, we expect most coefficients in such a linear system of differential equations to be zero. In previously proposed methods to infer such a linear system, ad hoc assumptions were made to limit the number of nonzero coefficients in the system. Instead, we propose to infer the degree of sparseness of the gene regulatory network from the data, where we determine which coefficients are nonzero by using Akaike's Information Criterion.

Journal ArticleDOI
TL;DR: This work has developed a computer software, named G.NET, for visualizing and analyzing the large-scale gene network, and developed the gene network layout algorithms, named GNL algorithm, in order to display the big gene network in 2 and 3 dimensional spaces effectively.
Abstract: In recent years, for solving the whole aspect of gene regulation mechanism, the analysis of a gene network attracts considerable attention in the field of molecular biology and bioinformatics. Various methodologies [1, 3, 4] have been developed for inferring a gene network from cDNA microarray gene expression data. However, after constructing a gene network, there are still some problems to be solved in how to extract valuable information from such large-scale network. For example, finding the complex interactions among genes, the evaluation of the estimated gene pathways and so on. For a solution of these problems, we have developed a computer software, named G.NET, for visualizing and analyzing the large-scale gene network. We have also developed the gene network layout algorithms, named GNL algorithm, in order to display the large-scale gene network in 2 and 3 dimensional spaces effectively.

Journal ArticleDOI
TL;DR: This paper compares xml description of GON with XML description of SBML, examines whether it could be converted from GON Assembler to SBML or vice versa, and investigates the automatic conversion between GON and SBML.
Abstract: Recently, the importance of biosimulation softwares in systems biology has been emphasized and received considerable attentions. Since the information conventionally created with most of biosimulation softwares does not have a common format for modeling, it has been very difficult to exchange pathway models among them. At present, the Systems Biology group of ERATO Kitano Symbiosis System Project takes the lead, and has proposed the standard language System Biology Mark-up Language (SBML) [1],which can give a common infrastructure for several biosimulation tools such as Bio/SPISE, E-Cell, DBSolve, Gepasi, Stochsim, and Virtual Cell. On the other hand, Genomic Object Net (GON) [2, 3] is a biosimulation system which uses hybrid functional Petri net (HFPN) and extensible markup language (XML) as basic mechanisms. With GON, even the researchers, who are not familiar with the mathematical modeling techniques such as differential equation and programming, can perform modeling and the simulation of a biological phenomenon easily. This paper compares XML description of GON with XML description of SBML, examines whether it could be converted from GON Assembler to SBML or vice versa, and investigates the automatic conversion between GON and SBML.

Proceedings Article
01 Dec 2002
TL;DR: In this paper, the problem of extracting multiple unordered short motifs in upstream regions of given genes was considered, and a fast method was developed to exhaustively search collections of short motif over given short motif for a particular set of genes, and rank collections with using multiple objective functions.
Abstract: In this paper, we consider the problem of extracting multiple unordered short motifs in upstream regions of given genes. Multiple unordered short motifs can be considered as a set of short motifs, say M = {m1, m2,..., mk}. For a gene g, if each of the motifs m1, .... ,mk occurs in either the upstream region or its complement of g, the gene g is said to be consistent with M. We have developed a fast method to exhaustively search collections of short motifs over given short motifs for a particular set of genes, and rank collections with using multiple objective functions. This method is implemented by employing bit operations in the process of matching short motifs with upstream regions, and identifying the members of genes which are consistent with short motifs. On various putatively co-regulated genes of Sacchrormyces cerevisiae, determined by gene expression profiles, our computational experiments show biologically interesting results.

Journal ArticleDOI
TL;DR: A fast method to exhaustively search collections of short motifs over given long motifs for a particular set of genes, and rank collections with using multiple objective functions is developed.


Book ChapterDOI
03 Jul 2002
TL;DR: In this paper, the authors studied the problem of finding a position-specific score matrix (PSSM) which correctly discriminates between positive and negative examples, and proved that this problem is solved in polynomial time if the size of a PSSM is bounded by a constant.
Abstract: PSSMs (Position-Specific Score Matrices) have been applied to various problems in Bioinformatics. We study the following problem: given positive examples (sequences) and negative examples (sequences), finda PSSM which correctly discriminates between positive and negative examples. We prove that this problem is solvedin polynomial time if the size of a PSSM is bounded by a constant. On the other hand, we prove that this problem is NP-hard if the size is not bounded. We also prove similar results on deriving a mixture of PSSMs.

Book ChapterDOI
24 Nov 2002
TL;DR: This paper defines a measure of approximation of a hypothesis class C1 to another class C2 and discusses lower bounds of the approximation ratios among representative classes of hypotheses like decision lists, decision trees, linear discriminant functions and so on.
Abstract: Computational knowledge discovery can be considered to be a complicated human activity concerned with searching for something new from data with computer systems. The optimization of the entire process of computational knowledge discovery is a big challenge in computer science. If we had an atlas of hypothesis classes which describes prior and basic knowledge on relative relationship between the hypothesis classes, it would be helpful in selecting hypothesis classes to be searched in discovery processes. In this paper, to give a foundation for an atlas of various classes of hypotheses, we have defined a measure of approximation of a hypothesis class C1 to another class C2. The hypotheses we consider here are restricted to m-ary Boolean functions. For 0 ? ? ? 1, we say that C1 is (1-?)-approximated to C2 if, for every distribution D over {0, 1}m, and for each hypothesis h1 ? C1, there exists a hypothesis h1 ? C1 such that, with the probability at most ?, we have h1(x) ? h2(x) where x ? {0, 1}m is drawn randomly and independently according to D. Thus, we can use the approximation ratio of C1 to C2 as an index of how similar C1 is to C2. We discuss lower bounds of the approximation ratios among representative classes of hypotheses like decision lists, decision trees, linear discriminant functions and so on. This prior knowledge would come in useful when selecting hypothesis classes in the initial stage and the sequential stages involved in the entire discovery process.

Journal ArticleDOI
TL;DR: GON is proved to be a useful and flexible biosimulation system that is suitable for analyzing complex biological phenomena such as patternings of multicellular systems as well as intracellular changes in cell states including metabolic activities, gene regulation, and enzyme reactions.
Abstract: The Delta-Notch signaling system plays an essential role in various morphogenetic systems of multicellular animal development [1]. Here we analyzed the mechanism of Notch-dependent boundary formation in the Drosophila large intestine by experimental manipulation of Delta expression and computational modeling and simulation by Genomic Object Net (GON) [3]. GON is proved to be a useful and flexible biosimulation system that is suitable for analyzing complex biological phenomena such as patternings of multicellular systems as well as intracellular changes in cell states including metabolic activities, gene regulation, and enzyme reactions.

Journal Article
TL;DR: In this article, a measure of approximation of a hypothesis class C 1 to another class C 2 is defined, where C 1 is (1 - e)-approximated to C 2 if, for every distribution D over {0, 1} m, and for each hypothesis h 1 E C 1, there exists a hypothesis h 2 E C 2 such that, with the probability at most e, we have h 1 (x) ¬= h 2 e C 2 (x), where x E is drawn randomly and independently according to D.
Abstract: Computational knowledge discovery can be considered to be a complicated human activity concerned with searching for something new from data with computer systems. The optimization of the entire process of computational knowledge discovery is a big challenge in computer science. If we had an atlas of hypothesis classes which describes prior and basic knowledge on relative relationship between the hypothesis classes, it would be helpful in selecting hypothesis classes to be searched in discovery processes. In this paper, to give a foundation for an atlas of various classes of hypotheses, we have defined a measure of approximation of a hypothesis class C 1 to another class C 2 . The hypotheses we consider here are restricted to m-ary Boolean functions. For 0 < e < 1, we say that C 1 is (1 - e)-approximated to C 2 if, for every distribution D over {0,1} m , and for each hypothesis h 1 E C 1 , there exists a hypothesis h 2 E C 2 such that, with the probability at most e, we have h 1 (x) ¬= h 2 (x) where x E {0,1} m is drawn randomly and independently according to D. Thus, we can use the approximation ratio of C 1 to C 2 as an index of how similar C 1 is to C 2 . We discuss lower bounds of the approximation ratios among representative classes of hypotheses like decision lists, decision trees, linear discriminant functions and so on. This prior knowledge would come in useful when selecting hypothesis classes in the initial stage and the sequential stages involved in the entire discovery process.

Journal ArticleDOI
TL;DR: In GON, biopathway is modeled as an extension of hybrid PetriNet called hybrid functional Petri net (HFPN) for which XML documentation is defined, and user has to write XML documents for personalized visualization.
Abstract: Genomic Object Net (GON) is a biosimlation tool that allows us to model various kinds of biopathways including gene regulatory networks, metabolic pathways, and signal transduction pathways in a biologically intuitive way [4] In GON, biopathway is modeled as an extension of hybrid Petri net called hybrid functional Petri net (HFPN) for which XML documentation is defined As a graphical editor of a biopathway, GON equips with a tool GON Assembler for drawing and simulating the biopathway Of couse, user need not touch any XML definitions Furthermore, neither ordinary differential equations with messy coefficients nor programming labors are explicity required for user to model biopathways in GON Since it is designed so powerful and flexible, E-CELL can also be realized as a subsystem of GON [1] GON also has a visulalization tool for simulation called GON Visualizer By writing an XML document for GON Visualizer, user can animate and see visually the interactions in a biopathway so that user can create and test hypotheses for biological phenomena [2, 3] However, user has to write XML documents for personalized visualization although it is not a disastrous obstacle like the ODE designs and C++ programming for biologists

Journal ArticleDOI
TL;DR: A new version Genomic Object Net in JAVA (JAVA GON for short) is developed, which inherits basic ideas and concepts in GenomicObject Net while enhancing the ability for handling not only biopathways but also localization information and multicellular processes.
Abstract: In the post-genome era, biopathway information processing will be one of the most important issues in Bioinformatics. Development of Genomic Object Net [6] is our approach to this issue. This software aims at describing and simulating structurally complex dynamic causal interactions and processes such as metabolic pathways, signal transduction cascades, gene regulations. We have released Genomic Object Net (ver. 0.919) in 2001. With this system, we have shown that we can reorganize and represent various biopathway information so that biopathways can be modeled and simulated for new hypothesis generation and testing (see [3, 4, 5, 6]). Although we have succeeded in modeling and simulating various biopathways without so much efforts, we have further identified more inconveniences through our biopathway modeling activities. This motivated us to develop a new version Genomic Object Net in JAVA (JAVA GON for short) from scratch. It inherits basic ideas and concepts in Genomic Object Net (ver. 0.919) while enhancing the ability for handling not only biopathways but also localization information and multicellular processes.

Journal ArticleDOI
TL;DR: This work analyzed a dataset of several hundred cases in which aberrant splicing is caused by mutations and made use of the human genomic sequence and studied the sequences in the regions of these mutations.
Abstract: Splicing is a process that removes introns from the pre-mRNA transcript of genes and thereby connects the exons to form the mature mRNA. It takes place in the cell nucleus and is known to play a major role in the expression of genetic information in eukaryotes. Alternative splicing, splicing enhancers, and splicing inhibitors are fields of active research [1, 2, 3]. In this work, we focus on aberrant splicing. Aberrant splicing refers to abnormal variations in the splicing process that can cause diseases. We analyze a dataset of several hundred cases in which aberrant splicing is caused by mutations. The mutations are substitutions, insertions, deletions and duplications. In about 95% of these cases, the splicing was affected and changed in one of the following ways: (at least one) exon was skipped, an intron was retained, the length of one exon was changed. In order to understand the differential effect of mutations on splicing, we made use of the human genomic sequence and studied the sequences in the regions of these mutations.

Journal Article
TL;DR: This work proves that the problem of finding a PSSM which correctly discriminates between positive and negative examples is solved in polynomial time and proves that this problem is NP-hard if the size is not bounded.
Abstract: PSSMs (Position-Specific Score Matrices) have been applied to various problems in Bioinformatics. We study the following problem: given positive examples (sequences) and negative examples (sequences), find a PSSM which correctly discriminates between positive and negative examples. We prove that this problem is solved in polynomial time if the size of a PSSM is bounded by a constant. On the other hand, we prove that this problem is NP-hard if the size is not bounded. We also prove similar results on deriving a mixture of PSSMs.

Book ChapterDOI
01 Jan 2002
TL;DR: The aim is to construct a principle of computational knowledge discovery, which will be used for building actual applications or discovery systems, and for accelerating such entire processes, called VOX (View Oriented eXploration).
Abstract: We propose a new paradigm for computational knowledge discovery, called VOX (View Oriented eXploration). Recent research has revealed that actual discoveries cannot be achieved using only component technologies such as machine learning theory or data mining algorithms. Recognizing how the computer can assist the actual discovery tasks, we developed a solution to this problem. Our aim is to construct a principle of computational knowledge discovery, which will be used for building actual applications or discovery systems, and for accelerating such entire processes. VOX is a mathematical abstraction of knowledge discovery processes, and provides a unified description method for the discovery processes. We present advantages obtained by using VOX. Through an actual computational experiment, we show the usefulness of this new paradigm. We also designed a programming language based on this concept. The language is called VML (View Modeling Language), which is defined as an extension of a functional language ML. Finally, we present the future plans and directions in this research.