scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Selection of Important Features for Optimizing Crop Yield Prediction

TL;DR: Five Feature Selection algorithms namely Sequential Forward FS, Sequential Backward Elimination FS, Correlation based FS, Random Forest Variable Importance and the Variance Inflation Factor algorithm for feature selection are discussed.
Abstract: In agriculture, crop yield prediction is critical. Crop yield depends on various features including geographic, climate and biological. This research article discusses five Feature Selection (FS) algorithms namely Sequential Forward FS, Sequential Backward Elimination FS, Correlation based FS, Random Forest Variable Importance and the Variance Inflation Factor algorithm for feature selection. Data used for the analysis was drawn from secondary sources of the Tamil Nadu state Agriculture Department for a period of 30 years. 75% of data was used for training and 25% data was used for testing. The performance of the feature selection algorithms are evaluated by Multiple Linear Regression. RMSE, MAE, R and RRMSE metrics are calculated for the feature selection algorithms. The adjusted R2 was used to find the optimum feature subset. Also, the time complexity of the algorithms was considered for the computation. The selected features are applied to Multilinear regression, Artificial Neural Network and M5Prime. MLR gives 85% of accuracy by using the features which are selected by SFFS algorithm.
Citations
More filters
Journal ArticleDOI
TL;DR: In this paper, climate change has begun to affect crop yields badly, and farmers are unable to choose the best crop for their own needs. But, they do not have the expertise to adapt to the changing climate.
Abstract: Earlier, crop cultivation was undertaken on the basis of farmers’ hands-on expertise. However, climate change has begun to affect crop yields badly. Consequently, farmers are unable to choose the r...

45 citations

Journal ArticleDOI
TL;DR: This work proposes a novel FS approach called modified recursive feature elimination (MRFE) to select appropriate features from a data set for crop prediction to select and ranks salient features using a ranking method.
Abstract: Crop cultivation prediction is an integral part of agriculture and is primarily based on factors such as soil, environmental features like rainfall and temperature, and the quantum of fertilizer used, particularly nitrogen and phosphorus. These factors, however, vary from region to region: consequently, farmers are unable to cultivate similar crops in every region. This is where machine learning (ML) techniques step in to help find the most suitable crops for a particular region, thus assisting farmers a great deal in crop prediction. The feature selection (FS) facet of ML is a major component in the selection of key features for a particular region and keeps the crop prediction process constantly upgraded. This work proposes a novel FS approach called modified recursive feature elimination (MRFE) to select appropriate features from a data set for crop prediction. The proposed MRFE technique selects and ranks salient features using a ranking method. The experimental results show that the MRFE method selects the most accurate features, while the bagging technique helps accurately predict a suitable crop. The performance of proposed MRFE technique is evaluated by various metrics such as accuracy (ACC), precision, recall, specificity, F1 score, area under the curve, mean absolute error, and log loss. From the performance analysis, it is justified that the MRFE technique performs well with 95% ACC than other FS methods.

21 citations

Journal ArticleDOI
TL;DR: Tests revealed that the developed hybrid feature selection and extraction technique performed with significant improvements with respect to Rsq2, RtMSE, and “mean absolute error” (MAE) in comparison to FS and FeExt methods such as Correlation Analysis (CA), Singular Valued Decomposition (SiVD), Genetic Algorithm (GA), and wgt-PCA on “benchmark” and ”real-world” farming datasets.
Abstract: Data pre-processing is a technique that transforms the raw data into a useful format for applying machine learning (ML) techniques. Feature selection (FS) and feature extraction (FeExt) form significant components of data pre-processing. FS is the identification of relevant features that enhances the accuracy of a model. Since, agricultural data contain diverse features related to climate, soil, fertilizer, FS attains significant importance as irrelevant features may adversely impact the prediction of the model built. Likewise, FeExt involves the derivation of new attributes from the prevailing attributes. All the information that the original attributes possess is present in these new features minus the duplicity. Keeping these points in mind, this work proposes a hybrid feature selection and feature extraction strategy for selecting features from the agricultural data set. A modified-Genetic Algorithm (m-GA) was developed by designing a fitness function based on “Mutual Information” (MutInf), and “Root Mean Square Error” (RtMSE) to choose the best features that affected the target attribute (crop yield in this case). These selected features were then subjected to feature extraction using “weighted principal component analysis” (wgt-PCA). The extracted features were then fed into different ML models viz. “Regression” (Reg), “Artificial Neural Networks” (ArtNN), “Adaptive Neuro Fuzzy Inference System” (ANFIS), “Ensemble of Trees” (EnT), and “Support Vector Regression” (SuVR). Trials on 3 benchmark and 8 real-world farming datasets revealed that the developed hybrid feature selection and extraction technique performed with significant improvements with respect to Rsq2, RtMSE, and “mean absolute error” (MAE) in comparison to FS and FeExt methods such as Correlation Analysis (CA), Singular Valued Decomposition (SiVD), Genetic Algorithm (GA), and wgt-PCA on “benchmark” and “real-world” farming datasets.

11 citations

Journal ArticleDOI
01 Sep 2022-Sensors
TL;DR: A framework that uses non-feature reduction (All-F) as a baseline to investigate the performance of FS, FX, and a combination of both (FSX) is proposed, revealing that FSX takes full advantage of the FS and FX, leading FSX-based models to perform the best in 18 out of 21 models.
Abstract: Machine learning (ML) has been widely used worldwide to develop crop yield forecasting models. However, it is still challenging to identify the most critical features from a dataset. Although either feature selection (FS) or feature extraction (FX) techniques have been employed, no research compares their performances and, more importantly, the benefits of combining both methods. Therefore, this paper proposes a framework that uses non-feature reduction (All-F) as a baseline to investigate the performance of FS, FX, and a combination of both (FSX). The case study employs the vegetation condition index (VCI)/temperature condition index (TCI) to develop 21 rice yield forecasting models for eight sub-regions in Vietnam based on ML methods, namely linear, support vector machine (SVM), decision tree (Tree), artificial neural network (ANN), and Ensemble. The results reveal that FSX takes full advantage of the FS and FX, leading FSX-based models to perform the best in 18 out of 21 models, while 2 (1) for FS-based (FX-based) models. These FXS-, FS-, and FX-based models improve All-F-based models at an average level of 21% and up to 60% in terms of RMSE. Furthermore, 21 of the best models are developed based on Ensemble (13 models), Tree (6 models), linear (1 model), and ANN (1 model). These findings highlight the significant role of FS, FX, and specially FSX coupled with a wide range of ML algorithms (especially Ensemble) for enhancing the accuracy of predicting crop yield.

4 citations

References
More filters
Proceedings Article
21 Aug 2003
TL;DR: A novel concept, predominant correlation, is introduced, and a fast filter method is proposed which can identify relevant features as well as redundancy among relevant features without pairwise correlation analysis.
Abstract: Feature selection, as a preprocessing step to machine learning, is effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility. However, the recent increase of dimensionality of data poses a severe challenge to many existing feature selection methods with respect to efficiency and effectiveness. In this work, we introduce a novel concept, predominant correlation, and propose a fast filter method which can identify relevant features as well as redundancy among relevant features without pairwise correlation analysis. The efficiency and effectiveness of our method is demonstrated through extensive comparisons with other methods using real-world data of high dimensionality

2,251 citations

Proceedings Article
03 Jul 1996
TL;DR: An efficient algorithm for feature selection which computes an approximation to the optimal feature selection criterion is given, showing that the algorithm effectively handles datasets with a very large number of features.
Abstract: In this paper, we examine a method for feature subset selection based on Information Theory. Initially, a framework for defining the theoretically optimal, but computationally intractable, method for feature subset selection is presented. We show that our goal should be to eliminate a feature if it gives us little or no additional information beyond that subsumed by the remaining features. In particular, this will be the case for both irrelevant and redundant features. We then give an efficient algorithm for feature selection which computes an approximation to the optimal feature selection criterion. The conditions under which the approximate algorithm is successful are examined. Empirical results are given on a number of data sets, showing that the algorithm effectively handles datasets with a very large number of features.

1,713 citations

Proceedings Article
03 Jul 1996
TL;DR: The theoretic analysis and the experimental study show that the proposed proba bilistic approach is simple to implement and guaranteed to be the optimal if resources permit.
Abstract: Feature selection can be de ned as a problem of nding a minimum set of M relevant at tributes that describes the dataset as well as the original N attributes do where M N After examining the problems with both the exhaustive and the heuristic approach to fea ture selection this paper proposes a proba bilistic approach The theoretic analysis and the experimental study show that the pro posed approach is simple to implement and guaranteed to nd the optimal if resources permit It is also fast in obtaining results and e ective in selecting features that im prove the performance of a learning algo rithm An on site application involving huge datasets has been conducted independently It proves the e ectiveness and scalability of the proposed algorithm Discussed also are various aspects and applications of this fea ture selection algorithm

733 citations

Journal ArticleDOI
TL;DR: Support vector machines (SVM) are attractive for the classification of remotely sensed data with some claims that the method is insensitive to the dimensionality of the data and, therefore, does not require a dimensionality-reduction analysis in preprocessing, but it is shown that the accuracy of a classification by an SVM does vary as a function of the number of features used.
Abstract: Support vector machines (SVM) are attractive for the classification of remotely sensed data with some claims that the method is insensitive to the dimensionality of the data and, therefore, does not require a dimensionality-reduction analysis in preprocessing. Here, a series of classification analyses with two hyperspectral sensor data sets reveals that the accuracy of a classification by an SVM does vary as a function of the number of features used. Critically, it is shown that the accuracy of a classification may decline significantly (at 0.05 level of statistical significance) with the addition of features, particularly if a small training sample is used. This highlights a dependence of the accuracy of classification by an SVM on the dimensionality of the data and, therefore, the potential value of undertaking a feature-selection analysis prior to classification. Additionally, it is demonstrated that, even when a large training sample is available, feature selection may still be useful. For example, the accuracy derived from the use of a small number of features may be noninferior (at 0.05 level of significance) to that derived from the use of a larger feature set providing potential advantages in relation to issues such as data storage and computational processing costs. Feature selection may, therefore, be a valuable analysis to include in preprocessing operations for classification by an SVM.

708 citations

Book ChapterDOI
Usama M. Fayyad1
17 Sep 1997
TL;DR: In this paper, the authors define the basic notions in data mining and KDD, define the goals, present motivation, and give a high-level definition of the KDD Process and how it relates to Data Mining.
Abstract: Data Mining and knowledge Discovery in Databases (KDD) promise to play an important role in the way people interact with databases, especially decision support databases where analysis and exploration operations are essential. Inductive logic programming can potentially play some key roles in KDD. This is an extended abstract for an invited talk in the conference. In the talk, we define the basic notions in data mining and KDD, define the goals, present motivation, and give a high-level definition of the KDD Process and how it relates to Data Mining. We then focus on data mining methods. Basic coverage of a sampling of methods will be provided to illustrate the methods and how they are used. We cover a case study of a successful application in science data analysis: the classification of cataloging of a major astronomy sky survey covering 2 billion objects in the northern sky. The system can outperform human as well as classical computational analysis tools in astronomy on the task of recognizing faint stars and galaxies. We also cover the problem of scaling a clustering problem to a large catalog database of billions of objects. We conclude with a listing of research challenges and we outline area where ILP could play some important roles in KDD.

609 citations