scispace - formally typeset
Search or ask a question
Author

Rahul Chowdhury

Bio: Rahul Chowdhury is an academic researcher from VIT University. The author has contributed to research in topics: Cluster analysis & Rough set. The author has an hindex of 2, co-authored 7 publications receiving 13 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: Improvements are made and conceptualize an algorithm, which is called MMeMeR or Min-Mean-mean-Roughness, which takes care of uncertainty and also handles heterogeneous data.
Abstract: In recent times enumerable number of clustering algorithms have been developed whose main function is to make sets of objects having almost the same features. But due to the presence of categorical data values, these algorithms face a challenge in their implementation. Also some algorithms which are able to take care of categorical data are not able to process uncertainty in the values and so have stability issues. Thus handling categorical data along with uncertainty has been made necessary owing to such difficulties. So, in 2007 MMR algorithm was developed which was based on basic rough set theory. MMeR was proposed in 2009 which surpassed the results of MMR in taking care of categorical data and it could also handle heterogeneous values as well. SDR and SSDR were postulated in 2011 which were able to handle hybrid data. These two showed more accuracy when compared to MMR and MMeR. In this paper, we further make improvements and conceptualize an algorithm, which we call MMeMeR or Min-Mean-Mean-Roughness. It takes care of uncertainty and also handles heterogeneous data. Standard data sets have been used to gauge its effectiveness over the other methods.

8 citations

Book ChapterDOI
01 Jan 2017
TL;DR: This research work tries to explain the process of data collection from Twitter, the preprocessing of the data, building a model to fit theData, comparing the accuracy of support vector machines and Naive Bayes algorithm for text classification and state the reason for the superiority of support vectors machine over NaiveBayes algorithm.
Abstract: The Zika disease is a 2015–16 virus epidemic and continues to be a global health issue. The recent trend in sharing critical information on social networks such as Twitter has been a motivation for us to propose a classification model that classifies tweets related to Zika and thus enables us to extract helpful insights into the community. In this paper, we try to explain the process of data collection from Twitter, the preprocessing of the data, building a model to fit the data, comparing the accuracy of support vector machines and Naive Bayes algorithm for text classification and state the reason for the superiority of support vector machine over Naive Bayes algorithm. Useful analytical tools such as word clouds are also presented in this research work to provide a more sophisticated method to retrieve community support from social networks such as Twitter.

7 citations

Proceedings ArticleDOI
10 Nov 2016
TL;DR: Bagging, boosting and voting ensemble learning have been used to improve the precision rate i.e. accuracy of classification and class association rules are performed to see if it performs better than collaborative filtering for suggesting item to the user.
Abstract: Cars are an essential part of our everyday life. Nowadays we have a wide plethora of cars produced by a number of companies in all segments. The buyer has to consider a lot of factors while buying a car which makes the whole process a lot more difficult. So in this paper we have developed a method of ensemble learning to aid people in making the decision. Bagging, boosting and voting ensemble learning have been used to improve the precision rate i.e. accuracy of classification. Also we have performed class association rules to see if it performs better than collaborative filtering for suggesting item to the user.

2 citations

Proceedings ArticleDOI
01 Jun 2017
TL;DR: A generic overview of role of regular expressions in big data analytics is given and a novel regex coordinating design that backings such capacities in an asset proficient manner is proposed.
Abstract: Content examination frameworks, for example, IBM's SystemT programming, depend on standard expressions (regexs) and word references for changing unstructured information into an organized arrangement. Dissimilar to network interruption identification frameworks, content examination frameworks register and report accurately where the particular and delicate data begins and closures in a content archive. Along these lines, progressed regex coordinating capacities, for example, begin counterbalance reporting, catching gatherings, and furthest left match calculation are intensely utilized as a part of content examination frameworks. Additionally there is a novel regex coordinating design that backings such capacities in an asset proficient manner. The asset productivity is accomplished by 1) killing state replication, 2) staying away from costly counterbalance correlation operations in furthest left match calculation, and 3) minimizing the quantity of balance registers. Probes regex sets from content investigation and system interruption identification areas, utilizing an Altera Stratix IV FPGA, demonstrate that the proposed design accomplishes a more than triple decrease of the rationale assets utilized and a more than 1.25-overlap increment of the clock recurrence as for an as of late proposed engineering that backings indistinguishable elements. The paper gives a generic overview of role of regular expressions in big data analytics.

2 citations

Proceedings ArticleDOI
01 Mar 2017
TL;DR: The MMeR algorithm with neighbourhood relations is generalized and made a neighbourhood rough set model which it is called MMeNR (Min Mean Neighborhood Roughness), which takes care of the heterogeneous data and also the uncertainty associated with it.
Abstract: In recent times enumerable number of clustering algorithms have been developed whose main function is to make sets of objects having almost the same features. But due to the presence of categorical data values, these algorithms face a challenge in their implementation. Also some algorithms which are able to take care of categorical data are not able to process uncertainty in the values and so have stability issues. Thus handling categorical data along with uncertainty has been made necessary owing to such difficulties. So, in 2007 MMR [1] algorithm was developed which was based on basic rough set theory. MMeR [2] was proposed in 2009 which surpassed the results of MMR in taking care of categorical data. It has the capability of handling heterogeneous data but only to a limited extent because it is based on classical rough set model. In this paper, we generalize the MMeR algorithm with neighbourhood relations and make it a neighbourhood rough set model which we call MMeNR (Min Mean Neighborhood Roughness). It takes care of the heterogeneous data and also the uncertainty associated with it. Standard data sets have been used to gauge its effectiveness over the other methods.

Cited by
More filters
Journal ArticleDOI
TL;DR: Experimental results show that rough set-based clustering methods provided better efficiency than hard and fuzzy methods and methods based on the initialization of the centroids also provided good results.
Abstract: Clustering is a complex unsupervised method used to group most similar observations of a given dataset within the same cluster. To guarantee high efficiency, the clustering process should ensure hi...

14 citations

Journal ArticleDOI
TL;DR: The fuzzy set technique is used to introduce a similarity measure, which is termed as Kernel and Set Similarity Measure, to find the similarity of sequential data and generate overlapping clusters and this is the first fuzzy clustering algorithm for sequential data.
Abstract: With the increase in popularity of the Internet and the advancement of technology in the fields like bioinformatics and other scientific communities the amount of sequential data is on the increase at a tremendous rate. With this increase, it has become inevitable to mine useful information from this vast amount of data. The mined information can be used in various spheres; from day to day web activities like the prediction of next web pages, serving better advertisements, to biological areas like genomic data analysis etc. A rough set based clustering of sequential data was proposed by Kumar et al recently. They defined and used a measure, called Sequence and Set Similarity Measure to determine similarity in data. However, we have observed that this measure does not reflect some important characteristics of sequential data. As a result, in this paper, we used the fuzzy set technique to introduce a similarity measure, which we termed as Kernel and Set Similarity Measure to find the similarity of sequential data and generate overlapping clusters. For this purpose, we used exponential string kernels and Jaccard's similarity index. The new similarity measure takes an account of the order of items in the sequence as well as the content of the sequential pattern. In order to compare our algorithm with that of Kumar et al, we used the MSNBC data set from the UCI repository, which was also used in their paper. As far as our knowledge goes, this is the first fuzzy clustering algorithm for sequential data.

8 citations

Journal ArticleDOI
TL;DR: A recent review of text mining in the domain of mosquito-borne disease was not available to the best of our knowledge as mentioned in this paper , however, the authors of this review presented a bibliometric analysis of the 294 scientific articles that have been published in Scopus and PubMed, from the year 2016 to 2021.

6 citations

Journal ArticleDOI
TL;DR: This study found that the classification model from SVM algorithm provided the best result with 86.45% accuracy to correctly classify ‘Eligible’ status of candidates, while RT was the weakest model with the lowest accuracy rate for this purpose.
Abstract: Scholarship is a financial facility given to eligible students to extend Higher Education. Limited funding sources with the growing number of applicants force the Government to find solutions to help speed up and facilitate the selection of eligible students and then adopt a systematic approach for this purpose. In this study, a data mining approach was used to propose a classification model of scholarship award result determination. A dataset of successful and unsuccessful applicants was taken and processed as training data and testing data used in the modelling process. Five algorithms were employed to develop a classification model in determining the award of the scholarship, namely J48, SVM, NB, ANN and RT algorithms. Each model was evaluated using technical evaluation metric , such contingency table metrics, accuracy, precision , and recall measures. As a result, the best models were classified into two different categories: The best model classified for ‘Eligible’ status, and the best model classified for ‘Not Eligible’ status. The knowledge obtained from the rules-based model was evaluated through knowledge analysis conducted by technical and domain experts. This study found that the classification model from SVM algorithm provided the best result with 86.45% accuracy to correctly classify ‘Eligible’ status of candidates, while RT was the weakest model with the lowest accuracy rate of for this purpose, with only 82.9% accuracy. The model that had the highest accuracy rate for ‘Not Eligible’ status of scholarship offered was NB model, whereas SVM model was the weakest model to classify ‘Not Eligible’ status. In addition, the knowledge analysis of the decision tree model was also made and found that some new information derived from the acquisition of this research information may help the stakeholders in making new policies and scholarship programmes in the future.

5 citations

Book ChapterDOI
30 Sep 2020
TL;DR: This work plans to exhibit a survey of the writing of big data arrangements in the medicinal services part, the potential changes, challenges, and accessible stages and philosophies to execute enormous information investigation in the healthcare sector.
Abstract: Tremendous measure of data lakes with the exponential mounting rate is produced by the present healthcare sector. The information from differing sources like electronic wellbeing record, clinical information, streaming information from sensors, biomedical image data, biomedical signal information, lab data, and so on brand it substantial as well as mind-boggling as far as changing information positions, which have stressed the abilities of prevailing regular database frameworks in terms of scalability, storage of unstructured data, concurrency, and cost. Big data solutions step in the picture by harnessing these colossal, assorted, and multipart data indexes to accomplish progressively important and learned patterns. The reconciliation of multimodal information seeking after removing the relationship among the unstructured information types is a hotly debated issue these days. Big data energizes in triumphing the bits of knowledge from these immense expanses of information. Big data is a term which is required to take care of the issues of volume, velocity, and variety generally seated in the medicinal services data. This work plans to exhibit a survey of the writing of big data arrangements in the medicinal services part, the potential changes, challenges, and accessible stages and philosophies to execute enormous information investigation in the healthcare sector. The work categories the big healthcare data (BHD) applications in five broad categories, followed by a prolific review of each sphere, and also offers some practical available real-life applications of BHD solutions.

3 citations