scispace - formally typeset
Search or ask a question

Showing papers by "Sushmita Paul published in 2013"


Journal ArticleDOI
TL;DR: An efficient method is proposed to select initial prototypes of different gene clusters, which enables the proposed c-means algorithm to converge to an optimum or near optimum solutions and helps to discover coexpressed gene clusters.
Abstract: Gene expression data clustering is one of the important tasks of functional genomics as it provides a powerful tool for studying functional relationships of genes in a biological process. Identifying coexpressed groups of genes represents the basic challenge in gene clustering problem. In this regard, a gene clustering algorithm, termed as robust rough-fuzzy $(c)$-means, is proposed judiciously integrating the merits of rough sets and fuzzy sets. While the concept of lower and upper approximations of rough sets deals with uncertainty, vagueness, and incompleteness in cluster definition, the integration of probabilistic and possibilistic memberships of fuzzy sets enables efficient handling of overlapping partitions in noisy environment. The concept of possibilistic lower bound and probabilistic boundary of a cluster, introduced in robust rough-fuzzy $(c)$-means, enables efficient selection of gene clusters. An efficient method is proposed to select initial prototypes of different gene clusters, which enables the proposed $(c)$-means algorithm to converge to an optimum or near optimum solutions and helps to discover coexpressed gene clusters. The effectiveness of the algorithm, along with a comparison with other algorithms, is demonstrated both qualitatively and quantitatively on 14 yeast microarray data sets.

95 citations


Journal ArticleDOI
TL;DR: The proposed algorithm is robust in the sense that it can find overlapping and vaguely defined clusters with arbitrary shapes in noisy environment and is demonstrated on synthetic as well as coding and non-coding RNA expression data sets using some cluster validity indices.
Abstract: Cluster analysis is a technique that divides a given data set into a set of clusters in such a way that two objects from the same cluster are as similar as possible and the objects from different clusters are as dissimilar as possible. In this background, different rough-fuzzy clustering algorithms have been shown to be successful for finding overlapping and vaguely defined clusters. However, the crisp lower approximation of a cluster in existing rough-fuzzy clustering algorithms is usually assumed to be spherical in shape, which restricts to find arbitrary shapes of clusters. In this regard, this paper presents a new rough-fuzzy clustering algorithm, termed as robust rough-fuzzy c-means. Each cluster in the proposed clustering algorithm is represented by a set of three parameters, namely, cluster prototype, a possibilistic fuzzy lower approximation, and a probabilistic fuzzy boundary. The possibilistic lower approximation helps in discovering clusters of various shapes. The cluster prototype depends on the weighting average of the possibilistic lower approximation and probabilistic boundary. The proposed algorithm is robust in the sense that it can find overlapping and vaguely defined clusters with arbitrary shapes in noisy environment. An efficient method is presented, based on Pearson's correlation coefficient, to select initial prototypes of different clusters. A method is also introduced based on cluster validity index to identify optimum values of different parameters of the initialization method and the proposed clustering algorithm. The effectiveness of the proposed algorithm, along with a comparison with other clustering algorithms, is demonstrated on synthetic as well as coding and non-coding RNA expression data sets using some cluster validity indices.

17 citations


Journal ArticleDOI
TL;DR: The results on several microarray data sets demonstrate that the proposed method can bring a remarkable improvement on miRNA selection problem and is a potentially useful tool for exploration of miRNA expression data and identification of differentially expressed miRNAs worth further investigation.
Abstract: The miRNAs, a class of short approximately 22‐nucleotide non‐coding RNAs, often act post‐transcriptionally to inhibit mRNA expression. In effect, they control gene expression by targeting mRNA. They also help in carrying out normal functioning of a cell as they play an important role in various cellular processes. However, dysregulation of miRNAs is found to be a major cause of a disease. It has been demonstrated that miRNA expression is altered in many human cancers, suggesting that they may play an important role as disease biomarkers. Multiple reports have also noted the utility of miRNAs for the diagnosis of cancer. Among the large number of miRNAs present in a microarray data, a modest number might be sufficient to classify human cancers. Hence, the identification of differentially expressed miRNAs is an important problem particularly for the data sets with large number of miRNAs and small number of samples. In this regard, a new miRNA selection algorithm, called μHEM, is presented based on rough hypercuboid approach. It selects a set of miRNAs from a microarray data by maximizing both relevance and significance of the selected miRNAs. The degree of dependency of sample categories on miRNAs is defined, based on the concept of hypercuboid equivalence partition matrix, to measure both relevance and significance of miRNAs. The effectiveness of the new approach is demonstrated on six publicly available miRNA expression data sets using support vector machine. The.632+ bootstrap error estimate is used to minimize the variability and biasedness of the derived results. An important finding is that the μHEM algorithm achieves lowest B.632+ error rate of support vector machine with a reduced set of differentially expressed miRNAs on four expression data sets compare to some existing machine learning and statistical methods, while for other two data sets, the error rate of the μHEM algorithm is comparable with the existing techniques. The results on several microarray data sets demonstrate that the proposed method can bring a remarkable improvement on miRNA selection problem. The method is a potentially useful tool for exploration of miRNA expression data and identification of differentially expressed miRNAs worth further investigation.

14 citations


Book ChapterDOI
19 Dec 2013
TL;DR: The proposed method judiciously integrates the merits of robust rough-fuzzy c-means algorithm and normalized range-normalized city block distance to discover co-expressed miRNA clusters and helps to handle minute differences between two miRNA expression profiles.
Abstract: The microRNAs or miRNAs are short, endogenous RNAs having ability to regulate gene expression at the post-transcriptional level. Various studies have revealed that a large proportion of miRNAs are co-expressed. Expression profiling of miRNAs generates a huge volume of data. Complicated networks of miRNA-mRNA interaction increase the challenges of comprehending and interpreting the resulting mass of data. In this regard, this paper presents the application of city block distance in order to extract meaningful information from miRNA expression data. The proposed method judiciously integrates the merits of robust rough-fuzzy c-means algorithm and normalized range-normalized city block distance to discover co-expressed miRNA clusters. The city block distance is used to calculate the membership functions of fuzzy sets, and thereby helps to handle minute differences between two miRNA expression profiles. The effectiveness of the proposed approach, along with a comparison with other related methods, is demonstrated on several miRNA expression data sets using different cluster validity indices and gene ontology.

8 citations


Journal ArticleDOI
TL;DR: A novel approach for in silico identification of differentially expressed miRNAs from microarray expression data sets by integrating judiciously the theory of rough sets and merit of the so-called B.632+ bootstrap error estimate.
Abstract: The microRNAs, also known as miRNAs, are the class of small noncoding RNAs. They repress the expression of a gene posttranscriptionally. In effect, they regulate expression of a gene or protein. It has been observed that they play an important role in various cellular processes and thus help in carrying out normal functioning of a cell. However, dysregulation of miRNAs is found to be a major cause of a disease. Various studies have also shown the role of miRNAs in cancer and the utility of miRNAs for the diagnosis of cancer and other diseases. Unlike with mRNAs, a modest number of miRNAs might be sufficient to classify human cancers. However, the absence of a robust method to identify differentially expressed miRNAs makes this an open problem. In this regard, this paper presents a novel approach for in silico identification of differentially expressed miRNAs from microarray expression data sets. It integrates judiciously the theory of rough sets and merit of the so-called B.632+ bootstrap error estimate. While rough sets select relevant and significant miRNAs from expression data, the B.632+ error rate minimizes the variability and bias of the derived results. The effectiveness of the proposed approach, along with a comparison with other related approaches, is demonstrated on several miRNA microarray expression data sets, using the support vector machine.

7 citations


Book ChapterDOI
01 Jan 2013
TL;DR: The chapter reports on a rough set-based feature selection algorithm called maximum relevance-maximum significance (MRMS), and its applications on quantitative structure activity relationship (QSAR) and gene expression data.
Abstract: Feature selection is an important data pre-processing step in pattern recognition and data mining. It is effective in reducing dimensionality and redundancy among the selected features, and increasing the performance of learning algorithm and generating information-rich features subset. In this regard, the chapter reports on a rough set-based feature selection algorithm called maximum relevance-maximum significance (MRMS), and its applications on quantitative structure activity relationship (QSAR) and gene expression data. It selects a set of features from a high-dimensional data set by maximizing the relevance and significance of the selected features. A theoretical analysis is reported to justify the use of both relevance and significance criteria for selecting a reduced feature set with high predictive accuracy. The importance of rough set theory for computing both relevance and significance of the features is also established. The performance of the MRMS algorithm, along with a comparison with other related methods, is studied on three QSAR data sets using the R 2 statistic of support vector regression method, and on five cancer and two arthritis microarray data sets by using the predictive accuracy of the K-nearest neighbor rule and support vector machine.

3 citations