scispace - formally typeset
Search or ask a question
Author

Durga Toshniwal

Other affiliations: Indian Institutes of Technology
Bio: Durga Toshniwal is an academic researcher from Indian Institute of Technology Roorkee. The author has contributed to research in topics: Cluster analysis & Naive Bayes classifier. The author has an hindex of 17, co-authored 86 publications receiving 1209 citations. Previous affiliations of Durga Toshniwal include Indian Institutes of Technology.


Papers
More filters
Journal ArticleDOI
TL;DR: This study proposes Hybrid Prediction Model (HPM) which uses Simple K-means clustering algorithm aimed at validating chosen class label of given data and subsequently applying the classification algorithm to the result set.
Abstract: A wide range of computational methods and tools for data analysis are available. In this study we took advantage of those available technological advancements to develop prediction models for the prediction of a Type-2 Diabetic Patient. We aim to investigate how the diabetes incidents are affected by patients' characteristics and measurements. Efficient predictive modeling is required for medical researchers and practitioners. This study proposes Hybrid Prediction Model (HPM) which uses Simple K-means clustering algorithm aimed at validating chosen class label of given data (incorrectly classified instances are removed, i.e. pattern extracted from original data) and subsequently applying the classification algorithm to the result set. C4.5 algorithm is used to build the final classifier model by using the k-fold cross-validation method. The Pima Indians diabetes data was obtained from the University of California at Irvine (UCI) machine learning repository datasets. A wide range of different classification methods have been applied previously by various researchers in order to find the best performing algorithm on this dataset. The accuracies achieved have been in the range of 59.4-84.05%. However the proposed HPM obtained a classification accuracy of 92.38%. In order to evaluate the performance of the proposed method, sensitivity and specificity performance measures that are used commonly in medical classification studies were used.

160 citations

Journal ArticleDOI
TL;DR: This paper proposed a framework that used K-modes clustering technique as a preliminary task for segmentation of 11,574 road accidents on road network of Dehradun (India) between 2009 and 2014 and revealed that the combination of k mode clustering and association rule mining is very inspiring.
Abstract: One of the key objectives in accident data analysis to identify the main factors associated with a road and traffic accident. However, heterogeneous nature of road accident data makes the analysis task difficult. Data segmentation has been used widely to overcome this heterogeneity of the accident data. In this paper, we proposed a framework that used K-modes clustering technique as a preliminary task for segmentation of 11,574 road accidents on road network of Dehradun (India) between 2009 and 2014 (both included). Next, association rule mining are used to identify the various circumstances that are associated with the occurrence of an accident for both the entire data set (EDS) and the clusters identified by K-modes clustering algorithm. The findings of cluster based analysis and entire data set analysis are then compared. The results reveal that the combination of k mode clustering and association rule mining is very inspiring as it produces important information that would remain hidden if no segmentation has been performed prior to generate association rules. Further a trend analysis have also been performed for each clusters and EDS accidents which finds different trends in different cluster whereas a positive trend is shown by EDS. Trend analysis also shows that prior segmentation of accident data is very important before analysis.

118 citations

Journal ArticleDOI
TL;DR: This paper applied k-means algorithm to group the accident locations into three categories, high-frequency, moderate-frequency and low-frequency accident locations, and used association rule mining to characterize these locations.
Abstract: Data mining has been proven as a reliable technique to analyze road accidents and provide productive results. Most of the road accident data analysis use data mining techniques, focusing on identifying factors that affect the severity of an accident. However, any damage resulting from road accidents is always unacceptable in terms of health, property damage and other economic factors. Sometimes, it is found that road accident occurrences are more frequent at certain specific locations. The analysis of these locations can help in identifying certain road accident features that make a road accident to occur frequently in these locations. Association rule mining is one of the popular data mining techniques that identify the correlation in various attributes of road accident. In this paper, we first applied k-means algorithm to group the accident locations into three categories, high-frequency, moderate-frequency and low-frequency accident locations. k-means algorithm takes accident frequency count as a parameter to cluster the locations. Then we used association rule mining to characterize these locations. The rules revealed different factors associated with road accidents at different locations with varying accident frequencies. The association rules for high-frequency accident location disclosed that intersections on highways are more dangerous for every type of accidents. High-frequency accident locations mostly involved two-wheeler accidents at hilly regions. In moderate-frequency accident locations, colonies near local roads and intersection on highway roads are found dangerous for pedestrian hit accidents. Low-frequency accident locations are scattered throughout the district and the most of the accidents at these locations were not critical. Although the data set was limited to some selected attributes, our approach extracted some useful hidden information from the data which can be utilized to take some preventive efforts in these locations.

117 citations

Proceedings ArticleDOI
09 Feb 2010
TL;DR: A new approach to generate association rules on numeric data and a modified equal width binning interval approach to discretizing continuous valued attributes are introduced to help the health doctors to explore their data and to understand the discovered rules better.
Abstract: The discovery of knowledge from medical databases is important in order to make effective medical diagnosis. The aim of data mining is extract the information from database and generate clear and understandable description of patterns. In this study we have introduced a new approach to generate association rules on numeric data. We propose a modified equal width binning interval approach to discretizing continuous valued attributes. The approximate width of the desired intervals is chosen based on the opinion of medical expert and is provided as an input parameter to the model. First we have converted numeric attributes into categorical form based on above techniques. Apriori algorithm is usually used for the market basket analysis was used to generate rules on Pima Indian diabetes data. The data set was taken from UCI machine learning repository containing total instances 768 and 8 numeric attributes.We discover that the often neglected pre-processing steps in knowledge discovery are the most critical elements in determining the success of a data mining application. Lastly we have generated the association rules which are useful to identify general associations in the data, to understand the relationship between the measured fields whether the patient goes on to develop diabetes or not. We are presented step-by-step approach to help the health doctors to explore their data and to understand the discovered rules better.

81 citations

Proceedings ArticleDOI
01 Dec 2015
TL;DR: Data mining techniques are used to analyze the data provided by EMRI in which they first cluster the accident data and further association rule mining technique is applied to identify circumstances in which an accident may occur for each cluster.
Abstract: Road accident is one of the crucial areas of research in India. A variety of research has been done on data collected through police records covering a limited portion of highways. The analysis of such data can only reveal information regarding that portion only; but accidents are scattered not only on highways but also on local roads. A different source of road accident data in India is Emergency Management research Institute (EMRI) which serves and keeps track of every accident record on every type of road and cover information of entire State's road accidents. In this paper, we have used data mining techniques to analyze the data provided by EMRI in which we first cluster the accident data and further association rule mining technique is applied to identify circumstances in which an accident may occur for each cluster. The results can be utilized to put some accident prevention efforts in the areas identified for different categories of accidents to overcome the number of accidents.

54 citations


Cited by
More filters
01 Jan 2002

9,314 citations

Journal ArticleDOI
TL;DR: An in depth review of rare event detection from an imbalanced learning perspective and a comprehensive taxonomy of the existing application domains of im balanced learning are provided.
Abstract: 527 articles related to imbalanced data and rare events are reviewed.Viewing reviewed papers from both technical and practical perspectives.Summarizing existing methods and corresponding statistics by a new taxonomy idea.Categorizing 162 application papers into 13 domains and giving introduction.Some opening questions are discussed at the end of this manuscript. Rare events, especially those that could potentially negatively impact society, often require humans decision-making responses. Detecting rare events can be viewed as a prediction task in data mining and machine learning communities. As these events are rarely observed in daily life, the prediction task suffers from a lack of balanced data. In this paper, we provide an in depth review of rare event detection from an imbalanced learning perspective. Five hundred and seventeen related papers that have been published in the past decade were collected for the study. The initial statistics suggested that rare events detection and imbalanced learning are concerned across a wide range of research areas from management science to engineering. We reviewed all collected papers from both a technical and a practical point of view. Modeling methods discussed include techniques such as data preprocessing, classification algorithms and model evaluation. For applications, we first provide a comprehensive taxonomy of the existing application domains of imbalanced learning, and then we detail the applications for each category. Finally, some suggestions from the reviewed papers are incorporated with our experiences and judgments to offer further research directions for the imbalanced learning and rare event detection fields.

1,448 citations

01 Jan 2009
TL;DR: In this paper, the authors assess 10 start-of-spring (SOS) methods for North America between 1982 and 2006 and find that SOS estimates were more related to the first leaf and first flowers expanding phenological stages.
Abstract: Shifts in the timing of spring phenology are a central feature of global change research. Long-term observations of plant phenology have been used to track vegetation responses to climate variability but are often limited to particular species and locations and may not represent synoptic patterns. Satellite remote sensing is instead used for continental to global monitoring. Although numerous methods exist to extract phenological timing, in particular start-of-spring (SOS), from time series of reflectance data, a comprehensive intercomparison and interpretation of SOS methods has not been conducted. Here, we assess 10 SOS methods for North America between 1982 and 2006. The techniques include consistent inputs from the 8km Global Inventory Modeling and Mapping Studies Advanced Very High Resolution Radiometer NDVIg dataset, independent data for snow cover, soil thaw, lake ice dynamics, spring streamflow timing, over 16000 individual measurements of ground-based phenology, and two temperature-driven models of spring phenology. Compared with an ensemble of the 10 SOS methods, we found that individual methods differed in average day-of-year estimates by ! 60 days and in standard deviation by ! 20 days. The ability of the satellite methods to retrieve SOS estimates was highest in northern latitudes and lowest in arid, tropical, and Mediterranean ecoregions. The ordinal rank of SOS methods varied geographically, as did the relationships between SOS estimates and the cryospheric/hydrologic metrics. Compared with ground observations, SOS estimates were more related to the first leaf and first flowers expanding phenological stages. We found no evidence for time trends in spring arrival from ground- or model-based data; using an ensemble estimate from two methods that were more closely related to ground observations than other methods, SOS

828 citations

Journal ArticleDOI
TL;DR: This article assesses the different machine learning methods that deal with the challenges in IoT data by considering smart cities as the main use case and presents a taxonomy of machine learning algorithms explaining how different techniques are applied to the data in order to extract higher level information.

690 citations

Journal ArticleDOI
TL;DR: It is found that the Support Vector Machine (SVM) algorithm is applied most frequently (in 29 studies) followed by the Naïve Bayes algorithm (in 23 studies), however, the Random Forest algorithm showed superior accuracy comparatively.
Abstract: Supervised machine learning algorithms have been a dominant method in the data mining field. Disease prediction using health data has recently shown a potential application area for these methods. This study aims to identify the key trends among different types of supervised machine learning algorithms, and their performance and usage for disease risk prediction. In this study, extensive research efforts were made to identify those studies that applied more than one supervised machine learning algorithm on single disease prediction. Two databases (i.e., Scopus and PubMed) were searched for different types of search items. Thus, we selected 48 articles in total for the comparison among variants supervised machine learning algorithms for disease prediction. We found that the Support Vector Machine (SVM) algorithm is applied most frequently (in 29 studies) followed by the Naive Bayes algorithm (in 23 studies). However, the Random Forest (RF) algorithm showed superior accuracy comparatively. Of the 17 studies where it was applied, RF showed the highest accuracy in 9 of them, i.e., 53%. This was followed by SVM which topped in 41% of the studies it was considered. This study provides a wide overview of the relative performance of different variants of supervised machine learning algorithms for disease prediction. This important information of relative performance can be used to aid researchers in the selection of an appropriate supervised machine learning algorithm for their studies.

580 citations