scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Predictive Modeling Approach for Surface Water Quality: Development and Comparison of Machine Learning Models

06 Jul 2021-Sustainability (Sustainability)-Vol. 13, Iss: 14, pp 7515
TL;DR: In this article, the authors investigated the predictive performance of gene expression programming (GEP), artificial neural network (ANN) and linear regression model (LRM) for modeling monthly total dissolved solids (TDS) and specific conductivity (EC) in the upper Indus River at two outlet stations.
Abstract: Water pollution is an increasing global issue that societies are facing and is threating human health, ecosystem functions and agriculture production. The distinguished features of artificial intelligence (AI) based modeling can deliver a deep insight pertaining to rising water quality concerns. The current study investigates the predictive performance of gene expression programming (GEP), artificial neural network (ANN) and linear regression model (LRM) for modeling monthly total dissolved solids (TDS) and specific conductivity (EC) in the upper Indus River at two outlet stations. In total, 30 years of historical water quality data, comprising 360 TDS and EC monthly records, were used for models training and testing. Based on a significant correlation, the TDS and EC modeling were correlated with seven input parameters. Results were evaluated using various performance measure indicators, error assessment and external criteria. The simulated outcome of the models indicated a strong association with actual data where the correlation coefficient above 0.9 was observed for both TDS and EC. Both the GEP and ANN models remained the reliable techniques in predicting TDS and EC. The formulated GEP mathematical equations depict its novelty as compared to ANN and LRM. The results of sensitivity analysis indicated the increasing trend of input variables affecting TDS as HCO3 ?? (22.33%) > Cl?? (21.66%) > Mg2+ (16.98%) > Na+ (14.55%) > Ca2+ (12.92%) > SO4 2?? (11.55%) > pH (0%), while, in the case of EC, it followed the trend as HCO3 ?? (42.36%) > SO4 2??(25.63%) > Ca2+ (13.59%) > Cl?? (12.8%) > Na+ (5.01%) > pH (0.61%) > Mg2+ (0%). The parametric analysis revealed that models have incorporated the effect of all the input parameters in the modeling process. The external assessment criteria confirmed the generalized outcome and robustness of the proposed approaches. Conclusively, the outcomes of this study demonstrated that the formulation of AI based models are cost effective and helpful for river water quality assessment, management and policy making.
Citations
More filters
Journal ArticleDOI
TL;DR: In this paper , a comparison of individual supervised ML models, such as gene expression programming (GEP) and artificial neural network (ANN), with that of an ensemble learning model, i.e., random forest (RF), for predicting river water salinity in terms of electrical conductivity (EC) and dissolved solids (TDS) in the Upper Indus River basin, Pakistan.
Abstract: The prediction accuracies of machine learning (ML) models may not only be dependent on the input parameters and training dataset, but also on whether an ensemble or individual learning model is selected. The present study is based on the comparison of individual supervised ML models, such as gene expression programming (GEP) and artificial neural network (ANN), with that of an ensemble learning model, i.e., random forest (RF), for predicting river water salinity in terms of electrical conductivity (EC) and dissolved solids (TDS) in the Upper Indus River basin, Pakistan. The projected models were trained and tested by using a dataset of seven input parameters chosen on the basis of significant correlation. Optimization of the ensemble RF model was achieved by producing 20 sub-models in order to choose the accurate one. The goodness-of-fit of the models was assessed through well-known statistical indicators, such as the coefficient of determination (R2), mean absolute error (MAE), root mean squared error (RMSE), and Nash–Sutcliffe efficiency (NSE). The results demonstrated a strong association between inputs and modeling outputs, where R2 value was found to be 0.96, 0.98, and 0.92 for the GEP, RF, and ANN models, respectively. The comparative performance of the proposed methods showed the relative superiority of the RF compared to GEP and ANN. Among the 20 RF sub-models, the most accurate model yielded the R2 equal to 0.941 and 0.938, with 70 and 160 numbers of corresponding estimators. The lowest RMSE values of 1.37 and 3.1 were yielded by the ensemble RF model on training and testing data, respectively. The results of the sensitivity analysis demonstrated that HCO3− is the most effective variable followed by Cl− and SO42− for both the EC and TDS. The assessment of the models on external criteria ensured the generalized results of all the aforementioned techniques. Conclusively, the outcome of the present research indicated that the RF model with selected key parameters could be prioritized for water quality assessment and management.

16 citations

01 Jan 2016
TL;DR: In this paper, Chen et al. used neural networks (NNs), fuzzy inference methods, support vector machines (SVMs), and k-nearest neighbors (k-NN) to solve complex problems in high dimensions.
Abstract: 2016 © American Water Works Association JOURNAL AWWA APRIL 2016 | 108:4 Assessment of surface water quality is important in the management of water resources (Dogan et al. 2009). Water quality in rivers is paramount to the well-being of nature and humans, and surface water quality is usually related to the type of surrounding industries, agriculture, and human activities. Water is withdrawn from the hydrologic cycle to meet various needs and then is returned (Banejad & Olyaie 2011). Given the essential role of rivers to agricultural, industrial, and urban needs, it is necessary to regularly monitor and evaluate water quality in rivers. As rivers pass through different regions, changes in water quality and the level of hydrochemical parameters are observed in these regions. Because of the gradual decline in water quality over time, regulatory bodies in various countries have made decisions to mitigate the damage. Ecologically acceptable water management calls for accurate modeling, forecasting, and analyzing water quality in rivers (Durdu 2010). Numerous models have been developed for management of water quality, such as QUAL2E, Water Quality Analysis Simulation, and the US Army Corps of Engineers’ Hydrologic Engineering Center-5Q (Chen et al. 2003). Using these models is time-consuming and expensive; therefore, development of cost-effective models is encouraged. Because of the propensity of varied standards for water quality, different parameters are used as quality indicators. The quantity of ammonia, cadmium, chemical oxygen demand, chlorine, copper, dissolved phosphorus, lead, nitrogen dioxide, suspended solids, total nitrogen, total phosphorus, zinc, sodium, sodium adsorption ratio, sulfate ions, bicarbonate ions, electrical conductivity (EC), total dissolved solids (TDS), and pH is frequently measured at water quality monitoring stations. EC and TDS levels in water are two of the main parameters used to determine quality of drinking and agricultural water because they directly represent the total concentration of salt in water. High EC and TDS values are not desirable in water used for irrigation because salt affects plant growth through osmosis (Phocaides 2000). Advances in data science and data mining methods such as neural networks (NNs), fuzzy inference methods, support vector machines (SVMs), and k-nearest neighbors (k-NN), have made it possible to solve complex problems in high dimensions. The general principle behind these methods lies in exploring hidden relationships in large volumes of data and building models that reflect physical processes governing the system under study. A data-derived model represents a relationship between input variables and output variables. Such a model can be highly accurate because it captures relationships of any kind that are expressed in data, including the underlying physics and chemistry.

14 citations

Journal ArticleDOI
TL;DR: The modification of the Ukrainian method for assessing the WQI, taking into account the level of negative impact of the most dangerous chemical elements is modified, using fuzzy logic and the creation of an artificial neural network model for the prediction of the W QI is proposed.
Abstract: Various human activities have been the main causes of surface water pollution. The uneven distribution of industrial enterprises in the territories of the main river basins of Ukraine do not always allow the real state of the water quality to be assessed. This article has three purposes: (1) the modification of the Ukrainian method for assessing the WQI, taking into account the level of negative impact of the most dangerous chemical elements, (2) the modeling of WQI assessment using fuzzy logic and (3) the creation of an artificial neural network model for the prediction of the WQI. The fuzzy logic model used four input variables and calculated one output variable (WQI). In the final stage of the study, six ANN models were analyzed, which differed from each other in various loss function optimizers and activation functions. The optimal results were shown using an ANN with the softmax activation function and Adam’s loss function optimizer (MAPE = 9.6%; R2 = 0.964). A comparison of the MAPE and R2 indicators of the created ANN model with other models for assessing water quality showed that the level of agreement between the forecast and target data is satisfactory. The novelty of this study is in the proposal to modify the WQI assessment methodology which is used in Ukraine. At the same time, the phased and joint use of mathematical tools such as the fuzzy logic method and the ANN allow one to effectively evaluate and predict WQI values, respectively.

11 citations

Journal ArticleDOI
TL;DR: In this paper, an improved form of supervised machine learning, i.e., multigene expression programming (MEP), has been used to propose models for the compressive strength (fc'), splitting tensile strength (fSTS), and flexural strength of sustainable bagasse ash concrete (BAC).
Abstract: The application of multiphysics models and soft computing techniques is gaining enormous attention in the construction sector due to the development of various types of concrete. In this research, an improved form of supervised machine learning, i.e., multigene expression programming (MEP), has been used to propose models for the compressive strength (fc'), splitting tensile strength (fSTS), and flexural strength (fFS) of sustainable bagasse ash concrete (BAC). The training and testing of the proposed models have been accomplished by developing a reliable and comprehensive database from published literature. Concrete specimens with varying proportions of sugarcane bagasse ash (BA), as a partial replacement of cement, were prepared, and the developed models were validated by utilizing the results obtained from the tested BAC. Different statistical tests evaluated the accurateness of the models, and the results were cross-validated employing a k-fold algorithm. The modeling results achieve correlation coefficient (R) and Nash-Sutcliffe efficiency (NSE) above 0.8 each with relative root mean squared error (RRMSE) and objective function (OF) less than 10 and 0.2, respectively. The MEP model leads in providing reliable mathematical expression for the estimation of fc', fSTS and fFS of BA concrete, which can reduce the experimental workload in assessing the strength properties. The study's findings indicated that MEP-based modeling integrated with experimental testing of BA concrete and further cross-validation is effective in predicting the strength parameters of BA concrete.

9 citations

References
More filters
Journal ArticleDOI
01 Jan 1988-Nature
TL;DR: Back-propagation repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector, which helps to represent important features of the task domain.
Abstract: We describe a new learning procedure, back-propagation, for networks of neurone-like units. The procedure repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector. As a result of the weight adjustments, internal ‘hidden’ units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions of these units. The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure1.

23,814 citations

Journal ArticleDOI
TL;DR: In this article, the principles governing the application of the conceptual model technique to river flow forecasting are discussed and the necessity for a systematic approach to the development and testing of the model is explained and some preliminary ideas suggested.

19,601 citations

Journal ArticleDOI
TL;DR: In this article, it is shown that many particular choices among possible neurophysiological assumptions are equivalent, in the sense that for every net behaving under one assumption, there exists another net which behaves under another and gives the same results, although perhaps not in the same time.

14,937 citations

Book
John R. Koza1
01 Jan 1992
TL;DR: This book discusses the evolution of architecture, primitive functions, terminals, sufficiency, and closure, and the role of representation and the lens effect in genetic programming.
Abstract: Background on genetic algorithms, LISP, and genetic programming hierarchical problem-solving introduction to automatically-defined functions - the two-boxes problem problems that straddle the breakeven point for computational effort Boolean parity functions determining the architecture of the program the lawnmower problem the bumblebee problem the increasing benefits of ADFs as problems are scaled up finding an impulse response function artificial ant on the San Mateo trail obstacle-avoiding robot the minesweeper problem automatic discovery of detectors for letter recognition flushes and four-of-a-kinds in a pinochle deck introduction to biochemistry and molecular biology prediction of transmembrane domains in proteins prediction of omega loops in proteins lookahead version of the transmembrane problem evolutionary selection of the architecture of the program evolution of primitives and sufficiency evolutionary selection of terminals evolution of closure simultaneous evolution of architecture, primitive functions, terminals, sufficiency, and closure the role of representation and the lens effect Appendices: list of special symbols list of special functions list of type fonts default parameters computer implementation annotated bibliography of genetic programming electronic mailing list and public repository

13,487 citations

Journal ArticleDOI
TL;DR: In this paper, the authors present guidelines for watershed model evaluation based on the review results and project-specific considerations, including single-event simulation, quality and quantity of measured data, model calibration procedure, evaluation time step, and project scope and magnitude.
Abstract: Watershed models are powerful tools for simulating the effect of watershed processes and management on soil and water resources. However, no comprehensive guidance is available to facilitate model evaluation in terms of the accuracy of simulated data compared to measured flow and constituent values. Thus, the objectives of this research were to: (1) determine recommended model evaluation techniques (statistical and graphical), (2) review reported ranges of values and corresponding performance ratings for the recommended statistics, and (3) establish guidelines for model evaluation based on the review results and project-specific considerations; all of these objectives focus on simulation of streamflow and transport of sediment and nutrients. These objectives were achieved with a thorough review of relevant literature on model application and recommended model evaluation methods. Based on this analysis, we recommend that three quantitative statistics, Nash-Sutcliffe efficiency (NSE), percent bias (PBIAS), and ratio of the root mean square error to the standard deviation of measured data (RSR), in addition to the graphical techniques, be used in model evaluation. The following model evaluation performance ratings were established for each recommended statistic. In general, model simulation can be judged as satisfactory if NSE > 0.50 and RSR < 0.70, and if PBIAS + 25% for streamflow, PBIAS + 55% for sediment, and PBIAS + 70% for N and P. For PBIAS, constituent-specific performance ratings were determined based on uncertainty of measured data. Additional considerations related to model evaluation guidelines are also discussed. These considerations include: single-event simulation, quality and quantity of measured data, model calibration procedure, evaluation time step, and project scope and magnitude. A case study illustrating the application of the model evaluation guidelines is also provided.

9,386 citations