Comparing the Performance of Artificial Neural Networks, Decision Tree, Principal Component Regression and Multiple Liner Regression in Modeling Urban Air Quality Index

Document Type : Research Paper

Authors

Abstract

1. Introduction:
Increasing urbanization and industrialization rate in developed and developing countries cities, such as Tehran, has led to increased air pollution. Todays, the prediction and estimation of air quality parameters in urban regions are important topics in environmental studies due to their effect on human health. Measurement of air quality are widely used in air quality control plans. These measurement classify air quality based on the amount of pollution and various contaminants. The first measure of air quality is Pollutant Standards Index (PSI) which has been developed by the U.S. Environmental Protection Agency (USA-EPA). This index converts concentration of the main air pollutants such as carbon monoxide (CO), sulfur dioxide (SO2), particulate matter less than ten microns (PM10), ozone (O3), and nitrogen dioxide (NO2) into the air pollution standard index. In 1997, PSI was expanded by the US-EPA and presented under a new index named Air Quality Index (AQI). One of the first steps that must be taken for air pollution control is measuring the concentration of air pollutants including PM10, CO, O3, SO2, and NO2. An index named AQI can determine the relationship between concentration of pollutants and the level of public health and controlling measures related to air pollution. This index classifies air quality into six main groups of good, moderate, unhealthy for sensitive groups, unhealthy, very unhealthy, and hazardous. This index also involves the controlling measures related to each class for preventing adverse effects of pollutants on different walks of life. Poor air quality caused by high concentrations of pollutants in the large city of Tehran has caused various diseases and many problems to the public health and welfare of citizens and also causes damage to the environment and living organisms. Hence, assessment and modeling of urban air quality, which has a nonlinear nature, and also determining the factors affecting it are considered one of the most essential environmental programs in large cities. Therefore, the present paper aims to compare the efficiency of artificial neural networks, decision tree, multiple liner regression and principal component regression in modeling and estimation of urban air quality index.

2. Materials and methods:
In the present study, hourly data on concentrations of air pollutants and meteorological parameters related to Tajrish and Gholhak stations in Tehran will be used for modeling and estimation of AQI. Meteorological and air pollution data recorded at Gholhak and Tajrish stations, Tehran covering the course 2005 to 2011 to develop models. For the assessment of the performance of the models and comparison of the obtained results in train and test phases, statistical indices such as Index of Agreement (IA), Fractional Bias (FB), Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Mean Square Error (MSE), Correlation Coefficient (R) and coefficient of determination (R2) were used. The initial objective is to use the guidelines of US-EPA and Iranian Center Environmental Health and Work (CEHW) to calculate air quality index based on the hourly concentrations of each of pollutants. In the next step, air pollution and AQI value will be obtained using time series of meteorological data. Then, simulator and estimator models of air pollution will be developed using artificial neural networks (ANN), decision tree, multiple liner regression (MLR) and principal component regression (PCR) methods in MATLAB software. In the first step, concentration of each of pollutants is the input to the algorithm of AQI calculation and the output will be air quality index for each pollutant and the overall air quality index will be used for development of models along with meteorological data. To develop the models, data were randomly divided into two categories of training and testing. In this study, 80 percent of data were used in the training phase and 20 percent of them were used in the testing phase. The final objective is simulation and estimation of air quality index for the studied stations in Tehran. At the end, the methods used for modeling in this study will be compared with each other in order to identify the model which produces better results of estimation and modeling.

3. Results:
The results of calculation of air quality index show that the dominant class of air quality in Gholhak Station is “unhealthy for sensitive groups” with 11165 hours and the main cause of poor quality of air in this station is nitrogen dioxide. In Tajrish station, the class “moderate” is dominant with 17538 hours and PM10 are the major responsible for this quality of air. The results of modeling showed that the efficiency of the applied methods in the study has different performances for the estimation of AQI. According to the findings, CART algorithm is of high performance in estimation of air quality index, as the correlation between simulated and observed values are very close to 1. Based on train and error, it was found that Perceptron artificial neural network with a hidden layer and Levenberg-Marquardt training algorithm, with 20 neurons in the hidden layer of Gholhak station and 25 neurons in the hidden layer of Tajrish station, yields the best performance in estimation and modeling of air quality index. The highest correlation between target variable and estimated values was also determined. Initial investigation showed that there is significant correlation between the input data used in Gholhak and Tajrish stations. To resolve this problem, principal component analysis (PCA) method was used. KMO test was used in order to determine the feasibility of PCA. Since KMO value was obtained 0.581 in Gholhak station and 0.606 in Tajrish station, the feasibility of PCA method was confirmed. To perform this method, after standardization of input variables, the correlation matrix was established and 13 eigenvalues and eigenvectors for Gholhak Station and 12 eigenvalues and eigenvectors for Tajrish station were obtained. The components 1 to 5 in Gholhak station and components 1 to 4 in Tajrish station had an eigenvalues greater than 1. These components were selected as the main components and used as the inputs to the regression model. Equations 1 and 2 show the regression model of AQI estimator in Gholhak and Tajrish stations:

AQI = -63/74 + (9/89 × PC1) + (0/ 2 × PC2) + (0/ 19 × PC3) – (0/ 094 × PC4) - (1/09 × PC5) (1)
AQI = 28/23 + (0/ 933 × PC1) + (0 / 2415 × PC2) + (0/ 0336 × PC3) - (0/ 0088 × PC4) (2)

4. Discussion and conclusion:
Error statistics in two stations showed that decision tree model in Gholhak Station has a better performance than this model in Tajrish Station. Correlation coefficient (R) and coefficient of determination (R2) in both models were very close to 1 which suggests the high ability of regression decision tree model in estimation of urban air quality. Comparison of error statistics in the studied stations showed that ANN model in Tajrish stations has a better performance than this model in Gholhak Station. Error statistics in both stations showed that PCR model in Tajrish station has a better performance than this model in Gholhak station. The results of investigation of all methods used for modeling and estimation of air quality index in the studied stations show that ANN model with Levenberg-Marquardt training algorithm had the best performance in both stations. The worst performance was observed in PCR model. In this research study, the air quality was monitored in two station. The findings of this research suggest that the models employed here are apt for the appraisal of air quality in the studied stations, and they can be used by researchers as a tool for gaining knowledge about the air quality and taking measures for controlling, decreasing, and preventing pollution as well as for more accurately informing the public on the air quality level in the polluted urban areas.

Keywords

Main Subjects