scispace - formally typeset
Search or ask a question

Showing papers by "Sanghamitra Bandyopadhyay published in 2015"


Journal ArticleDOI
TL;DR: A comprehensive and critical survey of the multitude of multiobjective evolutionary clustering techniques existing in the literature, classified according to the encoding strategies adopted, objective functions, evolutionary operators, strategy for maintaining nondominated solutions, and the method of selection of the final solution.
Abstract: Data clustering is a popular unsupervised data mining tool that is used for partitioning a given dataset into homogeneous groups based on some similarity/dissimilarity metric. Traditional clustering algorithms often make prior assumptions about the cluster structure and adopt a corresponding suitable objective function that is optimized either through classical techniques or metaheuristic approaches. These algorithms are known to perform poorly when the cluster assumptions do not hold in the data. Multiobjective clustering, in which multiple objective functions are simultaneously optimized, has emerged as an attractive and robust alternative in such situations. In particular, application of multiobjective evolutionary algorithms for clustering has become popular in the past decade because of their population-based nature. Here, we provide a comprehensive and critical survey of the multitude of multiobjective evolutionary clustering techniques existing in the literature. The techniques are classified according to the encoding strategies adopted, objective functions, evolutionary operators, strategy for maintaining nondominated solutions, and the method of selection of the final solution. The pros and cons of the different approaches are mentioned. Finally, we have discussed some real-life applications of multiobjective clustering in the domains of image segmentation, bioinformatics, web mining, and so forth.

132 citations


Journal ArticleDOI
TL;DR: An algorithm for many-objective optimization problems, which will work more quickly than existing ones, while offering competitive performance, and a new form of elitism so as to restrict the number of higher ranked solutions that are selected in the next population is proposed.
Abstract: In this paper we have developed an algorithm for many-objective optimization problems, which will work more quickly than existing ones, while offering competitive performance. The algorithm periodically reorders the objectives based on their conflict status and selects a subset of conflicting objectives for further processing. We have taken differential evolution multiobjective optimization (DEMO) as the underlying metaheuristic evolutionary algorithm, and implemented the technique of selecting a subset of conflicting objectives using a correlation-based ordering of objectives. The resultant method is called $\alpha $ -DEMO, where $\alpha $ is a parameter determining the number of conflicting objectives to be selected. We have also proposed a new form of elitism so as to restrict the number of higher ranked solutions that are selected in the next population. The $\alpha $ -DEMO with the revised elitism is referred to as $\alpha $ -DEMO-revised. Extensive results of the five DTLZ functions show that the number of objective computations required in the proposed algorithm is much less compared to the existing algorithms, while the convergence measures are competitive or often better. Statistical significance testing is also performed. A real-life application on structural optimization of factory shed truss is demonstrated.

89 citations


Journal ArticleDOI
TL;DR: This work presents a novel machine learning based approach, MBSTAR (Multiple instance learning of Binding Sites of miRNA TARgets), for accurate prediction of true or functional miRNA binding sites and predicts target mRNAs with highest accuracy.
Abstract: MicroRNA (miRNA) regulates gene expression by binding to specific sites in the 3′untranslated regions of its target genes. Machine learning based miRNA target prediction algorithms first extract a set of features from potential binding sites (PBSs) in the mRNA and then train a classifier to distinguish targets from non-targets. However, they do not consider whether the PBSs are functional or not, and consequently result in high false positive rates. This substantially affects the follow up functional validation by experiments. We present a novel machine learning based approach, MBSTAR (Multiple instance learning of Binding Sites of miRNA TARgets), for accurate prediction of true or functional miRNA binding sites. Multiple instance learning framework is adopted to handle the lack of information about the actual binding sites in the target mRNAs. Biologically validated 9531 interacting and 973 non-interacting miRNA-mRNA pairs are identified from Tarbase 6.0 and confirmed with PAR-CLIP dataset. It is found that MBSTAR achieves the highest number of binding sites overlapping with PAR-CLIP with maximum F-Score of 0.337. Compared to the other methods, MBSTAR also predicts target mRNAs with highest accuracy. The tool and genome wide predictions are available at http://www.isical.ac.in/~bioinfo_miu/MBStar30.htm.

65 citations


Journal ArticleDOI
TL;DR: The experimental results confirm the superiority of the proposed algorithm over the other state-of-the-art unsupervised feature selection algorithms for eight different kinds of datasets with the number of points and dimensions.
Abstract: In this article, an unsupervised feature selection algorithm is proposed using an improved version of a recently developed Differential Evolution technique called MoDE. The proposed algorithm produces an optimal feature subset while optimizing three criteria, namely, the average standard deviation of the selected feature subset, the average dissimilarity of the selected features, and the average similarity of non-selected features with respect to their first nearest neighbor selected features. Normalized mutual information score is employed for computing both the similarity as well as the dissimilarity measures. The experimental results confirm the superiority of the proposed algorithm over the other state-of-the-art unsupervised feature selection algorithms for eight different kinds of datasets with the number of points ranging from 80 to 6238 and the number of dimensions ranging from 30 to 649.

50 citations


Journal ArticleDOI
TL;DR: This study identifies selectively expressed mRNAs in Nef-expressing U937 cells and their exosomes and supports a new mode on intercellular regulation by the HIV-1 Nef protein.
Abstract: The Nef protein of human immunodeficiency virus (HIV) promotes viral replication and progression to AIDS. Besides its well-studied effects on intracellular signaling, Nef also functions through its secretion in exosomes, which are nanovesicles containing proteins, microRNAs, and mRNAs and are important for intercellular communication. Nef expression enhances exosome secretion and these exosomes can enter uninfected CD4 T cells leading to apoptotic death. We have recently reported the first miRNome analysis of exosomes secreted from Nef-expressing U937monocytic cells. Here we show genome-wide transcriptome analysis of Nef-expressing U937 cells and their exosomes. We identified four key mRNAs preferentially retained in Nef-expressing cells; these code for MECP2, HMOX1, AARSD1, and ATF2 and are important for chromatin modification and gene expression. Interestingly, their target miRNAs are exported out in exosomes. We also identified three key mRNAs selectively secreted in exosomes from Nef-expressing U937 cells and their corresponding miRNAs being preferentially retained in cells. These are AATK, SLC27A1, and CDKAL and are important in apoptosis and fatty acid transport. Thus, our study identifies selectively expressed mRNAs in Nef-expressing U937 cells and their exosomes and supports a new mode on intercellular regulation by the HIV-1 Nef protein.

36 citations


Journal ArticleDOI
10 Feb 2015-Gene
TL;DR: The experimental results demonstrate that miRNAs carry a unique signature that distinguishes cancer sub types and reveal new cancer subtypes, and additional survival analyses based on clinical data also strengthen this claim.

35 citations


Journal ArticleDOI
TL;DR: A comparative assessment of these studies and some methodologies for discussing the implication of their results are presented, and different computational techniques for predicting HIV-1-human PPIs are reviewed and a comparative study of their applicability is provided.
Abstract: The computational or in silico approaches for analysing the HIV-1-human protein-protein interaction (PPI) network, predicting different host cellular factors and PPIs and discovering several pathways are gaining popularity in the field of HIV research. Although there exist quite a few studies in this regard, no previous effort has been made to review these works in a comprehensive manner. Here we review the computational approaches that are devoted to the analysis and prediction of HIV-1-human PPIs. We have broadly categorized these studies into two fields: computational analysis of HIV-1-human PPI network and prediction of novel PPIs. We have also presented a comparative assessment of these studies and proposed some methodologies for discussing the implication of their results. We have also reviewed different computational techniques for predicting HIV-1-human PPIs and provided a comparative study of their applicability. We believe that our effort will provide helpful insights to the HIV research community.

35 citations


Journal ArticleDOI
01 Apr 2015-PLOS ONE
TL;DR: A computational rule mining framework, StatBicRM, to identify special type of rules and potential biomarkers using integrated approaches of statistical and binary inclusion-maximal biclustering techniques from the biological datasets, which performs better than the other rule mining algorithms as it generates maximal frequent closed homogeneous itemsets instead of frequent itemsets.
Abstract: Microarray and beadchip are two most efficient techniques for measuring gene expression and methylation data in bioinformatics. Biclustering deals with the simultaneous clustering of genes and samples. In this article, we propose a computational rule mining framework, StatBicRM (i.e., statistical biclustering-based rule mining) to identify special type of rules and potential biomarkers using integrated approaches of statistical and binary inclusion-maximal biclustering techniques from the biological datasets. At first, a novel statistical strategy has been utilized to eliminate the insignificant/low-significant/redundant genes in such way that significance level must satisfy the data distribution property (viz., either normal distribution or non-normal distribution). The data is then discretized and post-discretized, consecutively. Thereafter, the biclustering technique is applied to identify maximal frequent closed homogeneous itemsets. Corresponding special type of rules are then extracted from the selected itemsets. Our proposed rule mining method performs better than the other rule mining algorithms as it generates maximal frequent closed homogeneous itemsets instead of frequent itemsets. Thus, it saves elapsed time, and can work on big dataset. Pathway and Gene Ontology analyses are conducted on the genes of the evolved rules using David database. Frequency analysis of the genes appearing in the evolved rules is performed to determine potential biomarkers. Furthermore, we also classify the data to know how much the evolved rules are able to describe accurately the remaining test (unknown) data. Subsequently, we also compare the average classification accuracy, and other related factors with other rule-based classifiers. Statistical significance tests are also performed for verifying the statistical relevance of the comparative results. Here, each of the other rule mining methods or rule-based classifiers is also starting with the same post-discretized data-matrix. Finally, we have also included the integrated analysis of gene expression and methylation for determining epigenetic effect (viz., effect of methylation) on gene expression level.

31 citations


Journal ArticleDOI
TL;DR: Fast Overlapped Community Search (FOCS), an algorithm that accounts for local connectedness in order to identify overlapped communities, is proposed and is shown to be linear in number of edges and nodes.
Abstract: Discovery of natural groups of similarly functioning individuals is a key task in analysis of real world networks. Also, overlap between community pairs is commonplace in large social and biological graphs, in particular. In fact, overlaps between communities are known to be denser than the non-overlapped regions of the communities. However, most of the existing algorithms that detect overlapping communities assume that the communities are denser than their surrounding regions, and falsely identify overlaps as communities. Further, many of these algorithms are computationally demanding and thus, do not scale reasonably with varying network sizes. In this article, we propose Fast Overlapped Community Search (FOCS), an algorithm that accounts for local connectedness in order to identify overlapped communities. FOCS is shown to be linear in number of edges and nodes. It additionally gains in speed via simultaneous selection of multiple near-best communities rather than merely the best, at each iteration. FOCS outperforms some popular overlapped community finding algorithms in terms of computational time while not compromising with quality.

30 citations


Journal ArticleDOI
07 Apr 2015-RNA
TL;DR: It is revealed that the reduced expression of miR-145-5p/-3p pair potentially contributes to elevated expression of genes in the "FOXM1 transcription factor network" pathway, which may consequently lead to uncontrolled cell proliferation.
Abstract: A precursor microRNA (miRNA) has two arms: miR-5p and miR-3p (miR-5p/-3p). Depending on the tissue or cell types, both arms can become functional. However, little is known about their coregulatory mechanisms during the tumorigenic process. Here, by using the large-scale miRNA expression profiles of five cancer types, we revealed that several of miR-5p/-3p arms were concordantly dysregulated in each cancer. To explore possible coregulatory mechanisms of concordantly dysregulated miR-5p/-3p pairs, we developed a robust computational framework and applied it to lung cancer data. The framework deciphers miR-5p/-3p coregulated protein interaction networks critical to lung cancer development. As a novel part in the method, we uniquely applied the second-order partial correlation to minimize false-positive regulations. Using 279 matched miRNA and mRNA expression profiles extracted from tumor and normal lung tissue samples, we identified 17 aberrantly expressed miR-5p/-3p pairs that potentially modulate the gene expression of 35 protein complexes. Functional analyses revealed that these complexes are associated with cancer-related biological processes, suggesting the oncogenic potential of the reported miR-5p/-3p pairs. Specifically, we revealed that the reduced expression of miR-145-5p/-3p pair potentially contributes to elevated expression of genes in the "FOXM1 transcription factor network" pathway, which may consequently lead to uncontrolled cell proliferation. Subsequently, the regulation of miR-145-5p/-3p in the FOXM1signaling pathway was validated by a cohort of 104 matched miRNA and protein (reverse-phase protein array) expression profiles in lung cancer. In summary, our computational framework provides a novel tool to study miR-5p/-3p coregulatory mechanisms in cancer and other diseases.

28 citations


Journal ArticleDOI
01 Apr 2015
TL;DR: A new framework based on multiobjective optimization (MOO), namely FeaClusMOO, is proposed which is capable of identifying the correct partitioning as well as the most relevant set of features from a data set.
Abstract: In this paper a new framework based on multiobjective optimization (MOO), namely FeaClusMOO, is proposed which is capable of identifying the correct partitioning as well as the most relevant set of features from a data set. A newly developed multiobjective simulated annealing based optimization technique namely archived multiobjective simulated annealing (AMOSA) is used as the background strategy for optimization. Here features and cluster centers are encoded in the form of a string. As the objective functions, two internal cluster validity indices measuring the goodness of the obtained partitioning using Euclidean distance and point symmetry based distance, respectively, and a count on the number of features are utilized. These three objectives are optimized simultaneously using AMOSA in order to detect the appropriate subset of features, appropriate number of clusters as well as the appropriate partitioning. Points are allocated to different clusters using a point symmetry based distance. Mutation changes the feature combination as well as the set of cluster centers. Since AMOSA, like any other MOO technique, provides a set of solutions on the final Pareto front, a technique based on the concept of semi-supervised classification is developed to select a solution from the given set. The effectiveness of the proposed FeaClustMOO in comparison with other clustering techniques like its Euclidean distance based version where Euclidean distance is used for cluster assignment, a genetic algorithm based automatic clustering technique (VGAPS-clustering) using point symmetry based distance with all the features, K-means clustering technique with all features is shown for seven higher dimensional data sets obtained from real-life.

01 Jan 2015
TL;DR: A general approach is proposed to estimate the number of occupants in a zone using different kinds of measurements such as motion detection, power consumption or CO2 concentration using a C4.5 learning algorithm that yields human readable decision trees.
Abstract: A general approach is proposed to estimate the number of occupants in a zone using different kinds of measurements such as motion detection, power consumption or CO2 concentration. The proposed approach is inspired from machine learning. It starts by determining among different measurements those that are the most useful by calculating the information gains. Then, an estimation algorithm is proposed. It relies on a C4.5 learning algorithm that yields human readable decision trees using measurements to estimate the number of occupants. It has been applied to an office setting.

Journal ArticleDOI
TL;DR: The task of identifying protein complexes as a multiobjective optimization problem is presented and identified protein complexes are found to be associated with several disorders classes like ‘Cancer’, ‘Endocrine’ and ‘Multiple’.
Abstract: Detecting protein complexes within protein–protein interaction (PPI) networks is a major step toward the analysis of biological processes and pathways. Identification and characterization of protein complexes in PPI network is an ongoing challenge. Several high-throughput experimental techniques provide substantial number of PPIs which are widely utilized for compiling the PPI network of a species. Here we focus on detecting human protein complexes by developing a multiobjective framework. For this large human PPI network is partitioned into modules which serves as protein complex. For building the objective functions we have utilized topological properties of PPI network and biological properties based on Gene Ontology semantic similarity. The proposed method is compared with that of some state-of-the-art algorithms in the context of different performance metrics. For the purpose of biological validation of our predicted complexes we have also employed a Gene Ontology and pathway based analysis here. Additionally, we have performed an analysis to associate resulting protein complexes with 22 key disease classes. Two bipartite networks are created to clearly visualize the association of identified protein complexes with the disorder classes. Here, we present the task of identifying protein complexes as a multiobjective optimization problem. Identified protein complexes are found to be associated with several disorders classes like ‘Cancer’, ‘Endocrine’ and ‘Multiple’. This analysis uncovers some new relationships between disorders and predicted complexes that may take a potential role in the prediction of multi target drugs.

Proceedings ArticleDOI
24 Sep 2015
TL;DR: A general approach is proposed to estimate the number of occupants in a zone using different kinds of measurements such as motion detection, power consumption or CO2 concentration using a C4.5 learning algorithm that yields human readable decision trees.
Abstract: A general approach is proposed to estimate the number of occupants in a zone using different kinds of measurements such as motion detection, power consumption or CO2 concentration. The proposed approach is inspired from machine learning. It starts by determining among different measurements those that are the most useful by calculating the information gains. Then, an estimation algorithm is proposed. It relies on a C4.5 learning algorithm that yields human readable decision trees using measurements to estimate the number of occupants. It has been applied to an office setting.

Journal ArticleDOI
TL;DR: Pharmacophore-based virtual screening, subsequent docking, and molecular dynamics simulations have been done to identify potential inhibitors of maltosyl transferase of Mycobacterium tuberculosis (mtb GlgE) and have confirmed stable protein ligand binding.
Abstract: Pharmacophore-based virtual screening, subsequent docking, and molecular dynamics (MD) simulations have been done to identify potential inhibitors of maltosyl transferase of Mycobacterium tuberculo...

Journal ArticleDOI
TL;DR: PBE based AMOSA is found to comprehensively outperform AMOS a, MOEA/D-DE, the conventional ?


Proceedings ArticleDOI
01 Aug 2015
TL;DR: Experiments on drug discovery and image datasets show that the performance of the proposed algorithm (MI-FCKNN) is better than the traditional citation-kNN and competitive with most state-of-the-art algorithms.
Abstract: In multiple instance learning (MIL) setting, instances are grouped together in different labeled bags and the classifier tries to learn the label of unknown bags or instances. This is significantly different from traditional supervised learning techniques where the instances are labeled itself. In this work, a fuzzy based citation-kNN technique, which uses modified Hausdorff distance between bags, is introduced. Introduction of a fuzzy distance measure helps to solve the problem of overlapping bags. Effect of false positive instances in a positive bag are also reduced by calculating a fuzzy class membership for the training bags. Experiments on drug discovery and image datasets show that the performance of the proposed algorithm (MI-FCKNN) is better than the traditional citation-kNN and competitive with most state-of-the-art algorithms.

Journal ArticleDOI
TL;DR: The prediction of sequence conservation, molecular phylogenetics, protein–protein network and the association between the MODY cascades enhances opportunities to get more insights into the less-known MODY disease.
Abstract: Maturity onset diabetes of the young (MODY) is a metabolic and genetic disorder. It is different from type 1 and type 2 diabetes with low occurrence level (1–2 %) among all diabetes. This disorder is a consequence of β-cell dysfunction. Till date, 11 subtypes of MODY have been identified, and all of them can cause gene mutations. However, very little is known about the gene mapping, molecular phylogenetics, and co-expression among MODY genes and networking between cascades. This study has used latest servers and software such as VarioWatch, ClustalW, MUSCLE, G Blocks, Phylogeny.fr, iTOL, WebLogo, STRING, and KEGG PATHWAY to perform comprehensive analyses of gene mapping, multiple sequences alignment, molecular phylogenetics, protein–protein network design, co-expression analysis of MODY genes, and pathway development. The MODY genes are located in chromosomes-2, 7, 8, 9, 11, 12, 13, 17, and 20. Highly aligned block shows Pro, Gly, Leu, Arg, and Pro residues are highly aligned in the positions of 296, 386, 437, 455, 456 and 598, respectively. Alignment scores inform us that HNF1A and HNF1B proteins have shown high sequence similarity among MODY proteins. Protein–protein network design shows that HNF1A, HNF1B, HNF4A, NEUROD1, PDX1, PAX4, INS, and GCK are strongly connected, and the co-expression analyses between MODY genes also show distinct association between HNF1A and HNF4A genes. This study has used latest tools of bioinformatics to develop a rapid method to assess the evolutionary relationship, the network development, and the associations among eleven MODY genes and cascades. The prediction of sequence conservation, molecular phylogenetics, protein–protein network and the association between the MODY cascades enhances opportunities to get more insights into the less-known MODY disease.

Journal ArticleDOI
TL;DR: This paper proposes a stacked neural network model for finding out the largest quasi-complete module (core) in weighted graphs and shows the effectiveness of the proposed approach on DIMACS graphs.

Proceedings ArticleDOI
29 Oct 2015
TL;DR: The method discussed in this article identifies unique and high number of mutual connections through weighted self-information and has highly reduced number of edges, still conserving the centrality distributions as far as possible.
Abstract: On-line social networks mostly allow individuals to extend friend requests to all forms of possible connections including those related to official purposes, interests, family relations, friendships, and acquaintances. One requires to mine relevant connections in order to make reliable and meaningful interpretations following network analysis. Most networks lack weight assignments that mark the strength of a connection. Thus there is a requirement of methods that can effectively identify essential edges from only the topological information available. The method discussed in this article identifies unique and high number of mutual connections through weighted self-information. The extracted skeleton network has highly reduced number of edges, still conserving the centrality distributions as far as possible. The method used is applied locally to each node, to extract connections relevant to every node. Results are demonstrated on five datasets which show that the proposed method is able to eliminate a large number of irrelevant edges. The method is also found to scale well to large datasets.

Proceedings ArticleDOI
02 Mar 2015
TL;DR: Two syntactic patterns namely phrase structure and dependency structure are explored to produce improved results with respect to the Cancer Genetics Data provided in the BioNLP'13 Shared Task.
Abstract: This paper attempts to employ learning based pattern classification technique to extract events from biological literature. Although various approaches to extract events have been explored, none is suitable for designing a practical system of event extraction. Extracting events more precisely is still an ongoing process. In this paper, new features that seem to be relevant for the given task are investigated. Two syntactic patterns namely phrase structure and dependency structure are explored to produce improved results with respect to the Cancer Genetics Data provided in the BioNLP'13 Shared Task. A stacked model based on conditional probability scores are also considered as features. The patterns and the probability scores along with some other linguistic features are fed to SVMs to train it for the task of bio-event extraction from natural language articles. The results are compared with the performance of the best extraction system in Cancer Genetics Task.

Proceedings Article
01 Jan 2015
TL;DR: This book constitutes the proceedings of the 6th International Conference on Pattern Recognition and Machine Intelligence, PReMI 2015, held in Warsaw, Poland, in June/July 2015.
Abstract: This book constitutes the proceedings of the 6th International Conference on Pattern Recognition and Machine Intelligence, PReMI 2015, held in Warsaw, Poland, in June/July 2015. The total of 53 full papers and 1 short paper presented in this volume were carefully reviewed and selected from 90 submissions. They were organized in topical sections named: foundations of machine learning; image processing; image retrieval; image tracking; pattern recognition; data mining techniques for large scale data; fuzzy computing; rough sets; bioinformatics; and applications of artificial intelligence

Journal ArticleDOI
TL;DR: An unsupervised feature selection technique is proposed, using maximum information compression index as the dissimilarity measure and the well-known density-based cluster identification technique DBSCAN for identifying the largest natural group of dissimilar features.
Abstract: Reduction of dimensionality has emerged as a routine process in modelling complex biological systems. A large number of feature selection techniques have been reported in the literature to improve model performance in terms of accuracy and speed. In the present article an unsupervised feature selection technique is proposed, using maximum information compression index as the dissimilarity measure and the well-known density-based cluster identification technique DBSCAN for identifying the largest natural group of dissimilar features. The algorithm is fast and less sensitive to the user-supplied parameters. Moreover, the method automatically determines the required number of features and identifies them. We used the proposed method for reducing dimensionality of a number of benchmark data sets of varying sizes. Its performance was also extensively compared with some other well-known feature selection methods.