scispace - formally typeset
Search or ask a question

Showing papers by "Yi-Ping Phoebe Chen published in 2009"


Journal ArticleDOI
TL;DR: A novel algorithm is presented in this paper, which can be applied on a small sized data set with a high number of features and outperform the commonly used Principle Component Analysis (PCA)/Multi-Dimensional Scaling (MDS) methods, and the more recently developed ISOMap dimensionality reduction method.
Abstract: Emotional expression and understanding are normal instincts of human beings, but automatical emotion recognition from speech without referring any language or linguistic information remains an unclosed problem. The limited size of existing emotional data samples, and the relative higher dimensionality have outstripped many dimensionality reduction and feature selection algorithms. This paper focuses on the data preprocessing techniques which aim to extract the most effective acoustic features to improve the performance of the emotion recognition. A novel algorithm is presented in this paper, which can be applied on a small sized data set with a high number of features. The presented algorithm integrates the advantages from a decision tree method and the random forest ensemble. Experiment results on a series of Chinese emotional speech data sets indicate that the presented algorithm can achieve improved results on emotional recognition, and outperform the commonly used Principle Component Analysis (PCA)/Multi-Dimensional Scaling (MDS) methods, and the more recently developed ISOMap dimensionality reduction method.

198 citations


Journal ArticleDOI
TL;DR: Both extreme pathways and correlated reaction sets are derived from the topology information of metabolic networks and suggest a possible mechanism: as a controllable unit, an extreme pathway is regulated by its corresponding correlation sets, and a correlated reaction set is further regulated by the organism's regulatory network.
Abstract: Constraint-based modeling of reconstructed genome-scale metabolic networks has been successfully applied on several microorganisms In constraint-based modeling, in order to characterize all allowable phenotypes, network-based pathways, such as extreme pathways and elementary flux modes, are defined However, as the scale of metabolic network rises, the number of extreme pathways and elementary flux modes increases exponentially Uniform random sampling solves this problem to some extent to study the contents of the available phenotypes After uniform random sampling, correlated reaction sets can be identified by the dependencies between reactions derived from sample phenotypes In this paper, we study the relationship between extreme pathways and correlated reaction sets Correlated reaction sets are identified for E coli core, red blood cell and Saccharomyces cerevisiae metabolic networks respectively All extreme pathways are enumerated for the former two metabolic networks As for Saccharomyces cerevisiae metabolic network, because of the large scale, we get a set of extreme pathways by sampling the whole extreme pathway space In most cases, an extreme pathway covers a correlated reaction set in an 'all or none' manner, which means either all reactions in a correlated reaction set or none is used by some extreme pathway In rare cases, besides the 'all or none' manner, a correlated reaction set may be fully covered by combination of a few extreme pathways with related function, which may bring redundancy and flexibility to improve the survivability of a cell In a word, extreme pathways show strong complementary relationship on usage of reactions in the same correlated reaction set Both extreme pathways and correlated reaction sets are derived from the topology information of metabolic networks The strong relationship between correlated reaction sets and extreme pathways suggests a possible mechanism: as a controllable unit, an extreme pathway is regulated by its corresponding correlated reaction sets, and a correlated reaction set is further regulated by the organism's regulatory network

22 citations


Journal ArticleDOI
TL;DR: PseudoBase is a database providing structural, functional, and sequence data related to RNA pseudoknots, and a novel framework using quantitative association rule mining to analyze the pseudok not data is presented.
Abstract: An RNA pseudoknot consists of nonnested double-stranded stems connected by single-stranded loops. There is increasing recognition that RNA pseudoknots are one of the most prevalent RNA structures and fulfill a diverse set of biological roles within cells, and there is an expanding rate of studies into RNA pseudoknotted structures as well as increasing allocation of function. These not only produce valuable structural data but also facilitate an understanding of structural and functional characteristics in RNA molecules. PseudoBase is a database providing structural, functional, and sequence data related to RNA pseudoknots. To capture the features of RNA pseudoknots, we present a novel framework using quantitative association rule mining to analyze the pseudoknot data. The derived rules are classified into specified association groups regarding structure, function, and category of RNA pseudoknots. The discovered association rules assist biologists in filtering out significant knowledge of structure-function and structure-category relationships. A brief biological interpretation to the relationships is presented, and their potential correlations with each other are highlighted.

18 citations


Journal ArticleDOI
TL;DR: This new candidate working set (CWS) strategy can select several greatest violating samples from Cache as the iterative working sets for the next several optimizing steps, which can improve the efficiency of the kernel cache usage and reduce the computational cost related to the working set selection.
Abstract: Sequential minimal optimization (SMO) is quite an efficient algorithm for training the support vector machine. The most important step of this algorithm is the selection of the working set, which greatly affects the training speed. The feasible direction strategy for the working set selection can decrease the objective function, however, may augment to the total calculation for selecting the working set in each of the iteration. In this paper, a new candidate working set (CWS) Strategy is presented considering the cost on the working set selection and cache performance. This new strategy can select several greatest violating samples from Cache as the iterative working sets for the next several optimizing steps, which can improve the efficiency of the kernel cache usage and reduce the computational cost related to the working set selection. The results of the theory analysis and experiments demonstrate that the proposed method can reduce the training time, especially on the large-scale datasets.

17 citations


Journal ArticleDOI
TL;DR: Analyzing more than 800 organism's metabolic networks suggests that the reactions with larger impact degrees are likely essential and the universal reactions should also be essential, and shows that scale‐free feature and reaction reversibility contribute to the robustness in metabolic networks.
Abstract: Robustness is an inherent property of biological system It is still a limited understanding of how it is accomplished at the cellular or molecular level To this end, this article analyzes the impact degree of each reaction to others, which is defined as the number of cascading failures of following and/or forward reactions when an initial reaction is deleted By analyzing more than 800 organism's metabolic networks, it suggests that the reactions with larger impact degrees are likely essential and the universal reactions should also be essential Alternative metabolic pathways compensate null mutations, which represents that average impact degrees for all organisms are small Interestingly, average impact degrees of archaea organisms are smaller than other two categories of organisms, eukayote and bacteria, indicating that archaea organisms have strong robustness to resist the various perturbations during the evolution process The results show that scale-free feature and reaction reversibility contribute to the robustness in metabolic networks The optimal growth temperature of organism also relates the robust structure of metabolic network

17 citations


Journal ArticleDOI
TL;DR: This paper proposes a robust algorithm to find out rule groups to classify gene expression datasets and shows that the rule groups obtained by the algorithm have higher accuracy than that of other classification approaches.

14 citations


Proceedings ArticleDOI
01 Nov 2009
TL;DR: The idea of k-partite protein cliques suggests a novel approach to characterizing PPI networks, and may help function prediction for unknown proteins.
Abstract: We introduce a new topological concept called k-partite protein cliques to study protein interaction (PPI) networks.In particular, we examine functional coherence of proteins in k-partite protein cliques. A k-partite protein clique is a k-partite maximal clique comprising two or more nonoverlapping protein subsets between any two of which full interactions are exhibited. In the detection of PPI’s k-partite maximal cliques, we propose to transform PPI networks into induced K-partite graphs with proteins as vertices where edges only exist among the graph’s partites. Then, we present a k-partite maximal clique mining (MaCMik) algorithm to enumerate k-partite maximal cliques from K-partite graphs. Our MaCMik algorithm is applied to a yeast PPI network. We observe that there does exist interesting and unusually high functional coherence in k-partite proteincliques—most proteins in k-partite protein cliques, especially those in the same partites, share the same functions. Therefore, the idea of k-partite protein cliques suggests a novel approach to characterizing PPI networks, and may help function prediction for unknown proteins.

7 citations


Book ChapterDOI
23 Sep 2009
TL;DR: Experimental results show that the performance of the proposed descriptors is significantly better than other methods in the same category.
Abstract: In this paper, we have proposed a method for 2D image retrieval based on object shapes. The method relies on transforming the 2D images into 3D space based on distance transform. Spherical harmonics are obtained for the 3D data and used as descriptors for the underlying 2D images. The proposed method is compared against two existing methods which use spherical harmonics for shape based retrieval of images. MPEG-7 Still Images Content Set is used for performing experiments; this dataset consists of 3621 still images. Experimental results show that the performance of the proposed descriptors is significantly better than other methods in the same category.

7 citations


Proceedings ArticleDOI
19 Oct 2009
TL;DR: Results suggest the most effective means of breast cancer identification in the early stage is a hybrid approach.
Abstract: The goal of this research is to develop a computer aided diagnostic (CAD) system that can detect breast cancer in the early stage by using microarray and image data. We verified the performance of six well known classification algorithms with various performance matrices. Although we do not suggest a unique classifier algorithm for a CAD system, we do identify a number of algorithms whose performance is very promising. The algorithms performance was validated by 3 images dataset; two have been used for the first time in this experiment. Multidimensional image filtering is adopted for the final data extraction. The image data classification performance is compared with microarray data. Results suggest the most effective means of breast cancer identification in the early stage is a hybrid approach.

5 citations


Journal Article
TL;DR: An algorithm that adopts incremental feature combinations to effectively find the largest coverage is proposed, and the irrelevant coverage can be pruned away at early stages because potentially large Coverage can be found earlier.
Abstract: Coverage is the range that covers only positive samples in attribute (or feature) space. Finding coverage is the kernel problem in induction algorithms because of the fact that coverage can be used as rules to describe positive samples. To reflect the characteristic of training samples, it is desir-able that the large coverage that cover more positive samples. However, it is difficult to find large coverage, because the attribute space is usually very high dimensionality. Many heuristic methods such as ID3, AQ and CN2 have been proposed to find large coverage. A robust algorithm also has been proposed to find the largest coverage, but the complexities of time and space are costly when the dimensionality becomes high. To overcome this drawback, this paper proposes an algorithm that adopts incremental feature combinations to effectively find the largest coverage. In this algorithm, the irrelevant coverage can be pruned away at early stages because potentially large coverage can be found earlier. Experiments show that the space and time needed to find the largest coverage has been significantly reduced.

2 citations


Proceedings ArticleDOI
24 Apr 2009
TL;DR: An innovative data pre-processing approach to identify noise data in the data sets and eliminate or reduce the impact of the noise data on gene clustering, that makes the clustering results stable across clustering algorithms with different similarity metrics.
Abstract: The high-throughput experimental data from the new gene microarray technology has spurred numerous efforts to find effective ways of processing microarray data for revealing real biological relationships among genes. This work proposes an innovative data pre-processing approach to identify noise data in the data sets and eliminate or reduce the impact of the noise data on gene clustering, With the proposed algorithm, the pre-processed data sets make the clustering results stable across clustering algorithms with different similarity metrics, the important information of genes and features is kept, and the clustering quality is improved. The primary evaluation on real microarray data sets has shown the effectiveness of the proposed algorithm.

Book ChapterDOI
01 Sep 2009
TL;DR: This chapter introduces a state-of-the-art data mining technique, graph mining, which is good at defining and discovering interesting structural patterns in graphical data sets, and takes advantage of its expressive power to study protein structures.
Abstract: As one of the primary substances in a living organism, protein defines the character of each cell by interacting with the cellular environment to promote the cell’s growth and function [1]. Previous studies on proteomics indicate that the functions of different proteins could be assigned based upon protein structures [2,3]. The knowledge on protein structures gives us an overview of protein fold space and is helpful for the understanding of the evolutionary principles behind structure. By observing the architectures and topologies of the protein families, biological processes can be investigated more directly with much higher resolution and finer detail. For this reason, the analysis of protein, its structure and the interaction with the other materials is emerging as an important problem in bioinformatics. However, the determination of protein structures is experimentally expensive and time consuming, this makes scientists largely dependent on sequence rather than more general structure to infer the function of the protein at the present time. For this reason, data mining technology is introduced into this area to provide more efficient data processing and knowledge discovery approaches. Unlike many data mining applications which lack available data, the protein structure determination problem and its interaction study, on the contrary, could utilize a vast amount of biologically relevant information on protein and its interaction, such as the protein data bank (PDB) [4], the structural classification of proteins (SCOP) databases [5], CATH databases [6], UniProt [7], and others. The difficulty of predicting protein structures, specially its 3D structures, and the interactions between proteins as shown in Figure 6.1, lies in the computational complexity of the data. Although a large number of approaches have been developed to determine the protein structures such as ab initio modelling [8], homology modelling [9] and threading [10], more efficient and reliable methods are still greatly needed. In this chapter, we will introduce a state-of-the-art data mining technique, graph mining, which is good at defining and discovering interesting structural patterns in graphical data sets, and take advantage of its expressive power to study protein structures, including protein structure prediction and comparison, and protein-protein interaction (PPI). The current graph pattern mining methods will be described, and typical algorithms will be presented, together with their applications in the protein structure analysis. The rest of the chapter is organized as follows: Section 6.2 will give a brief introduction of the fundamental knowledge of protein, the publicly accessible protein data resources and the current research status of protein analysis; in Section 6.3, we will pay attention to one of the state-of-the-art data mining methods, graph mining; then Section 6.4 surveys several existing work for protein structure analysis using advanced graph mining methods in the recent decade; finally, in Section 6.5, a conclusion with potential further work will be summarized.