scispace - formally typeset
Search or ask a question
Author

Sajal Raj Joshi

Bio: Sajal Raj Joshi is an academic researcher from Siddaganga Institute of Technology. The author has contributed to research in topics: Feature selection & Ranking. The author has an hindex of 1, co-authored 1 publications receiving 12 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: This work proposes a new feature selection mechanism, an amalgamation of the filter and the wrapper techniques by taking into consideration the benefits of both the methods, based on a two phase process where the features are ranked and the best subset of features are chosen based on the ranking.
Abstract: Feature Selection has been a significant preprocessing procedure for classification in the area of Supervised Machine Learning. It is mostly applied when the attribute set is very large. The large set of attributes often tend to misguide the classifier. Extensive research has been performed to increase the efficacy of the predictor by finding the optimal set of features. The feature subset should be such that it enhances the classification accuracy by the removal of redundant features. We propose a new feature selection mechanism, an amalgamation of the filter and the wrapper techniques by taking into consideration the benefits of both the methods. Our hybrid model is based on a two phase process where we rank the features and then choose the best subset of features based on the ranking. We validated our model with various datasets, using multiple evaluation metrics. Furthermore, we have also compared and analyzed our results with previous works. The proposed model outperformed many existent algorithms and has given us good results.

23 citations


Cited by
More filters
Journal ArticleDOI
18 Aug 2021-Sensors
TL;DR: In this paper, a tri-stage wrapper-filter-based feature selection framework was proposed for the purpose of medical report-based disease detection, where an ensemble was formed by four filter methods, Mutual Information, ReliefF, Chi Square, and Xvariance, and each feature from the union set was assessed by three classification algorithms.
Abstract: In machine learning and data science, feature selection is considered as a crucial step of data preprocessing. When we directly apply the raw data for classification or clustering purposes, sometimes we observe that the learning algorithms do not perform well. One possible reason for this is the presence of redundant, noisy, and non-informative features or attributes in the datasets. Hence, feature selection methods are used to identify the subset of relevant features that can maximize the model performance. Moreover, due to reduction in feature dimension, both training time and storage required by the model can be reduced as well. In this paper, we present a tri-stage wrapper-filter-based feature selection framework for the purpose of medical report-based disease detection. In the first stage, an ensemble was formed by four filter methods-Mutual Information, ReliefF, Chi Square, and Xvariance-and then each feature from the union set was assessed by three classification algorithms-support vector machine, naive Bayes, and k-nearest neighbors-and an average accuracy was calculated. The features with higher accuracy were selected to obtain a preliminary subset of optimal features. In the second stage, Pearson correlation was used to discard highly correlated features. In these two stages, XGBoost classification algorithm was applied to obtain the most contributing features that, in turn, provide the best optimal subset. Then, in the final stage, we fed the obtained feature subset to a meta-heuristic algorithm, called whale optimization algorithm, in order to further reduce the feature set and to achieve higher accuracy. We evaluated the proposed feature selection framework on four publicly available disease datasets taken from the UCI machine learning repository, namely, arrhythmia, leukemia, DLBCL, and prostate cancer. Our obtained results confirm that the proposed method can perform better than many state-of-the-art methods and can detect important features as well. Less features ensure less medical tests for correct diagnosis, thus saving both time and cost.

49 citations

Journal ArticleDOI
TL;DR: The abnormal detection method proposed has given good results and outperformed typical feature strategies in an effective and generalizable way.
Abstract: The detection of abnormal electricity consumption behavior has been of great importance in recent years. However, existing research often focuses on algorithm improvement and ignores the process of obtaining features. The optimal feature set, which reflects customers’ electricity consumption behavior, has a significant influence on the final detection results. Moreover, it is not straightforward to obtain datasets with label information. In this paper, a method based on feature engineering for unsupervised detection of abnormal electricity consumption behavior is proposed. First, the original feature set is constructed by brainstorming in the feature engineering step. Then, the optimal feature set, which reflects the customers’ electricity consumption behavior, is obtained by features selected based on the variance and similarity between them. After that, in the abnormal detection step, a density-based clustering algorithm, in which the best clustering parameters are selected through iteration and evaluation, combined with unsupervised clustering evaluation indexes, is used to detect abnormal electricity consumption behaviors. Finally, using the load dataset of an industrial park, several typical feature strategies are applied for comparison with the feature engineering proposed in this paper. To perform the evaluation, the label information of abnormal behaviors is obtained by combining the original electricity consumption behavior detection results with abnormal data injections. The abnormal detection method proposed has given good results and outperformed typical feature strategies in an effective and generalizable way.

33 citations

Proceedings ArticleDOI
01 Dec 2019
TL;DR: The objective of the proposed work is to model multi-time-scale time series data on AR/MA by relying only on time and the label without the need of too many attributes and to model different time scales separately on Auto-regression (AR) and Moving Average (MA) models.
Abstract: Click fraud refers to the practice of generating random clicks on a link in order to extract illegitimate revenue from the advertisers. We present a generalized model for modeling temporal click fraud data in the form of probability or learning based anomaly detection and time series modeling with time scales like minutes and hours. The proposed approach consists of seven stages: Pre-processing, data smoothing, fraudulent pattern identification, homogenizing variance, normalizing auto-correlation, developing the AR and MA models and fine tuning along with evaluation of the models. The objective of the proposed work is to first, model multi-time-scale time series data on AR/MA by relying only on time and the label without the need of too many attributes and secondly, to model different time scales separately on Auto-regression (AR) and Moving Average (MA) models. Then, we evaluate the models by tuning forecasting errors and also by minimizing Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC) to obtain a best fit model for all time scale data. Through our experiments we also demonstrated that the Probability based model approach is better as compared to the Learning based probabilistic estimator model.

16 citations

Journal ArticleDOI
TL;DR: In this paper, two hybrid feature selection methods, metric ranked feature inclusion (MRIN) and accuracy ranked feature inclusion (ARIN), are proposed to reduce the computational overhead and improve the correctness of classifier.
Abstract: Feature selection has emerged as a craft, using which we boost the performance of our learning model. Feature or Attribute Selection is a data preprocessing technique, where only the most informative features are considered and given to the predictor. This reduces the computational overhead and improves the correctness of the classifier. Attribute Selection is commonly carried out by applying some filter or by using the performance of the learning model to gauge the quality of the attribute subset. Metric Ranked Feature Inclusion and Accuracy Ranked Feature Inclusion are the two novel hybrid feature selection methods we introduce in this paper. These algorithms follow a two-stage procedure, the first of which is feature ranking, followed by feature subset selection. They differ in the way they rank the features but follow the same subset selection technique. Multiple experiments have been conducted to assess our models. We compare our results with numerous works of the past and validate our models using 12 datasets. From the results, we infer that our algorithms perform better than many existent models.

8 citations

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a feature selection algorithm RIFS2D by integrating multiple incremental feature selection (IFS) blocks, which achieved better prediction accuracy and were targeted by more drugs than the t-test top-ranked features.

8 citations