Determining vulnerable areas of Malekan Plain Aquifer for Nitrate, Using Random Forest method

Document Type : Research Paper

Authors

1 Msc student of Hydrogeology, University of Tabriz

2 Professors of Hydrogeology, University of Tabriz,

3 Assistant Professors of Hydrogeology, University of Tabriz

Abstract

Determining vulnerable areas of Malekan Plain Aquifer for Nitrate, Using Random Forest method


Introduction:
Management of groundwater, especially in dry regions such as Iran, is essential and this concern becomes further with development of agriculture, industry, population growth and climate changes, that affecting the quality and quantity of groundwater resources. Hence, groundwater contamination can treat the human health. Since groundwater moves slowly through the subsurface, the impact of anthropogenic activities may last for a relatively long time and for that reason, the environmental measures should be mainly focused on the prevention of the contamination. One of the ways to prevent of groundwater contamination is identifying vulnerable regions of aquifers and management of land use. The assessment of groundwater vulnerability maps requires the application of diverse methods and techniques, based on the hydrogeological knowledge of the region under research and on the application of predictive models. With the aim of deciding which areas are vulnerable a large data volume can be collected which cannot be effectively analyzed without an adequate and efficient model. Several methods have been devised to vulnerability mapping that relatively using fewer data and based on evidence of contamination. In this study to overcoming the problems of other methods the random forest (RF) algorithms is proposed.
Materials and methods:
Malekan plain is located in East Azarbaijan Province, Southeast of Urmia Lake, northwest of Iran, with 450 Km2. This region is one of the very active cultivated areas which its water demands supply by groundwater resources. In recent years groundwater quality of the area is encountered with degradation problem. Malekan region have different geological formations such as Lalon, Shemshak, Lar formations, and a large part of the area in the western part is an alluvial deposits of Quaternary. Aquifer of this Plain is unconfined, which mainly formed by old and recent alluvial terraces, alluvial fans and fluvial sediments. Based on drilling wells logs and geophysical data, the west part of the plain is made of fine grained material with low permeable.
According to farming and existing of grape farms in this region and intensive use of fertilizers and manure the groundwater nitrate concentration of the aquifer is high (Figure 1).To evaluate the quality of groundwater resources, especially the assessment of nitrate anomalies in groundwater of the Malekan plain, 27 samples were collected from groundwater resources in September 2014, and Hydrochemical analysis were carried out in Hydrology Laboratory of Tabriz University. In this study the random forest (RF) algorithms, which is a learning method based on ensemble of decision trees, is proposed. The RF technique has advantages over other methods due to having, high prediction accuracy, ability to learn nonlinear relationships and ability to determine the important variables in the prediction. In this paper RF method is used to estimate the Malekan Aquifer vulnerability, with four sets of data, including A model with all variables, B model with variables related to characteristics of the aquifer, C model with driving forces variables, and D model with variables related to the DRASTIC method. The predictions derived from all possible parameter combinations were evaluated using the root mean square error (RMSE) and mean square error. The area under the curve statistic (AUC) was used to determine which models and which combination of dataset performed better. An AUC value of 1 is considered perfect.

Fig1. Spatial distribution of nitrate concentration
Results and Discussions:
From 23 explanatory variables used in model, five variables (depth to water table, hydraulic conductivity, distance to grape farms, hydraulic gradient and transmissivity) can describe the nitrates behavior in the Malekan plain aquifer with more accuracy, since a smaller MSE was obtained. In order to obtain continuous and standardized variables for all area of the study, all data were transformed into a raster format, and where were applied mainly three different approaches: 1) geostatistical techniques (e.g. hydraulic conductivity, hydraulic gradient and soil texture), 2) Euclidian distance raster calculations (potential point sources of contamination) and 3) classification of land cover from remotely sensed data and NDVI. In this paper RF method is used to estimate the Malekan Aquifer vulnerability, with four sets of data, including A model with all variables, B model with variables related to characteristics of the aquifer, C model with driving forces variables, and D model with variables related to the DRASTIC method. In order to set the value of k from which the error converges and which also makes estimation more reliable, models made up of 1000 trees were generated from all explanatory variables. The parameter was optimized by varying the number of split variables between 1 and the maximum number of variables of every subset. The resulting models were evaluated using the OOB error estimation. For the selection of the most accurate model the one in which the OOB error was the lowest is determined. Moreover, with the aim of reducing the dimensionality and improve the accuracy and interpretability of models, a FS strategy was adopted. The most significant predictive features were selected by using the importance measures of RF. The least significant explanatory variables of every subset were reduced until reaching the minimum error rate. Nitrate concentration was rescaled to a new response variable for every experimental sample: samples with nitrate concentrations higher or equal to the threshold value were given a value equal to 1 and samples lower to the threshold a value equal to 0. The explanative variables (predictors) and response variable were combined together into a set of input feature vectors. These vectors formed the input to the RF algorithm and are known as input-feature vectors. The binary response variable (nitrate pollution) was used as target values for the training of the algorithm. In this study, which four models were used to predict nitrate contamination of groundwater, as shown in Fig2, A and B Models, respectively with RMSE equal to 0/11157 and 0/12214, predicted approximately 44 and 42 percent of the region's in the high vulnerability that located In the central and eastern parts of the aquifer. However C and D models, respectively with RMSE equal to 0/1392 and 0/1597, predicted approximately 15 and 24 percent of the region's in the high vulnerability and could not be trusted in assessment of Groundwater vulnerability.


Fig 2. Vulnerability Map of the four models. A) All variables, B) variables related to characteristics of the aquifer, C) driving forces variables, and D) variables related to the DRASTIC
Keywords: Groundwater, Malekan plain, Nitrate, Vulnerability, Random Forest

Keywords

Main Subjects


Antonakos, A. K. and Lambrakis, N.J. 2007.  Development and testing of three hybrid methods for the assessment of aquifer vulnerability to nitrates, based on the drastic model, an example from NE Korinthia, Greece. Journal of Hydrology, Vol. 333(2), pp: 288–304.
Asghari Moghaddam, A., Fijani, E. and Nadiri, A. 2010. Groundwater Vulnerability Assessment Using GIS-Based DRASTIC
Model in the Bazargan and Poldasht Plains. Journal of Environmental Studies, Vol. 35, pp: No. 52.
Bellman, R. 2003. Dynamic programming. Mineola, NY: Dover Publications 366 pp.
Booker, D.J. and Snelder, T. H. 2012. Comparing methods for estimating flow duration curves at ungauged sites. Journal of Hydrology 434–435, 78–94.
Boulesteix, A.L, Janitza, S. Kruppa, J, and König IR. 2012. Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, Vol.24 (2), pp: 493-507.
Breiman, L. 1996. Bagging predictors. Machine Learning, Vol. 24(2), pp: 40-123.
Breiman, L. 2001. Random Forests. Machine Learning, Vol. 45(1), pp: 5–32.
Chehata, N., Guo, L. and Mallet, C. 2009. Airborne lidar feature selection for urban classification using random forests. International Archives of the Photogrammetry, Journal of Remote Sensing and Spatial Information Sciences, Vol. 39, pp: 207-12.
Critto, A., Carlon, C. and Marcomini, A. 2003. Characterization of contaminated soil and groundwater surrounding an illegal landfill by principal component analysis and kriging. Journal of Environmental Pollution, Vol. 122(2), pp: 235–44.
Dixon, B.A. 2009. Case study using support vector machines, neural networks and logistic regression in a GIS to identify wells contaminated with nitrate-N. Journal of Hydrogeology, Vol. 17(6), pp: 1507–20.
Duda, R.O., Hart, P.E. and Stork, D.G. 2011. Pattern classification. 2nd. Edition. New York Efron B and Tibshirani R, 1993. In An introduction to the bootstrap. Vol. 57, pp: CRC press.
Emberger, L. 1952. Sur le quotient pluviothermique. C.R. Sciences, Vol. 234, pp: 2508-2511.
Fijani, E., Nadiri, A., Asghari Moghaddam, A., Tsai, F.T. C. and Dixon, B. 2013. Optimization of DRASTIC method by supervised committee machine artificial intelligence to assess groundwater vulnerability for Maragheh–Bonab plain aquifer, Iran. Journal of Hydrology, Vol. 503, pp: 89–100.
Friedl, M. A., Brodley, C. E. and Strahler, A. H. 1999. Maximizing land cover classification accuracies produced by decision trees at continental to global scales. IEEE Trans Geoscience Remote Sensing, Vol. 37(2), pp: 969–77.
Gislason, PO. Benediktsson, JA. and Sveinsson, JR. 2004. Random forest classification of multisource remote sensing and geographic data. Journal of Geoscience and Remote Sensing Symposium, Vol. 2, pp: 1049-52.
Guo, L., Chehata, N., Mallet,C. and Boukir, S. 2011. Relevance of airborne lidar and multispectral imagedata for urban scene classification using Random Forests. ISPRS Journal of Photogram Remote Sensing, Vol. 66(1), pp: 56–66.
Guyon, I. and Elisseeff, A. 2003. An introduction to variable and feature selection. Journal of Machine Learning Reserch, Vol. 3, pp: 1157–82.
Ko, B., Gim, J. and Nam, J. 2011. Cell image classification based on ensemble features and random forest. Journal of Electronics Letters, Vol. 47, pp: 638-9.
Kotsiantis, S. and Pintelas, P. 2004. Combining bagging and boosting .International Journal of Computational Intelligence, Vol. 1(4), pp: 324–33.
Lehmann, P., and D. 2009. Evaporation and capillary coupling across vertical textural contrasts in porous media, journal of Physics, Vol. 80(4), pp: 18-46.
Nadiri, A., Fijani, E., Tsai,T.C. and Asghari Moghaddam, A.2013. Supervised Committee Machine with Artificial Intelligence for Prediction of Fluoride Concentration, Journal of Hydroinformatics. Vol. 15, pp: 1474–1490.
Pal, M. 2005. Random Forest classifier for remote sensing classification. International Journal of Remote Sensing, Vol. 26(1), pp: 217–22.
Pahlavan Rad, M.R., Toomanian, N., Khormali, F., Brungard, C., Komaki, C.B, and Bogaert, P. 2014. Updating soil survey maps using random forest and conditioned Latin hypercube sampling in the loess derived soils of northern Iran. Journal of Geoderma, Vol. 232, pp: 97–106
Peters, J., Baets, B. D.,Verhoest, N. E. C., Samson, R., Degroeve, S. and Becker, P. D. 2007. Random Forests as a tool for ecohydrological distribution modelling. Journal of Ecol Model, Vol. 207(2–4), pp: 304–18.
Quinlan, J. R. 1993. C4.5 programs for machine learning. San Mateo, CA: Morgan Kaurmann 303 pp.
Quinlan, J.R. 1986. Induction of decision trees. Journal of Machine Learning, Vol. 1(1), pp: 81-106.
Rodriguez, V. F., Ghimire, B., Rogan, J., Chica-Olmo, M. and Rigol-Sánchez, J. P. 2012d. An assessment of the effectiveness of a Random Forest classifier for land-cover classification. ISPRS Journal of Photogramm Remote Sensing, Vol. 67, pp: 91-104.
Schapire, R. 1990. The strength of weak learnability. Journal of Machine learning, Vol. 5, pp: 197-227.
Thapinta, A. and Hudak, P. 2003. Use of geographic information systems for assessing groundwater pollution potential by pesticides in Central Thailand. International journal of Environmental, Vol. 29, pp: 87–93
Tilahun, K. and Merkel, B. J .2010. Assessment of Groundwater Vulnerability to Pollution in Dire Dawa, Ethiopia using DRASTIC. Journal of Environmental Earth Sciences, Vol. 59, pp: 1485-1496.
Todd, D. K. 1980. Groundwater hydrology, John Wiley and Sons, New York.
Vrba, J. and Zoporozec, A. 1994. Guidebook on mapping groundwater vulnerability. International Contributions to Hydrogeology. 139 pp.
WHO (World Health Organization). 2009. Guideline for Drinking Water Quality.
Zabet, T. A .2002. Evaluation of aquifer vulnerability to contaminant potential using DRASTIC method.  Journal of Environmental Geology, Vol. 43(1-2), pp: 203-208.