Construction and evaluation of hourly average indoor PM2.5 concentration prediction models based on multiple types of places

Shi, Yewen; Du, Zhiyuan; Zhang, Jianghua; Han, Fengchan; Chen, Feier; Wang, Duo; Liu, Mengshuang; Zhang, Hao; Dong, Chunyang; Sui, Shaofeng

doi:10.3389/fpubh.2023.1213453

ORIGINAL RESEARCH article

Front. Public Health, 10 August 2023

Sec. Environmental Health and Exposome

Volume 11 - 2023 | https://doi.org/10.3389/fpubh.2023.1213453

This article is part of the Research TopicIndoor Environment and Respiratory DiseasesView all 5 articles

Construction and evaluation of hourly average indoor PM_2.5 concentration prediction models based on multiple types of places

Yewen Shi¹^†

Zhiyuan Du²^†

Jianghua Zhang¹

Fengchan Han¹

Feier Chen¹

Duo Wang¹

Mengshuang Liu¹

Hao Zhang²

Chunyang Dong¹^*

Shaofeng Sui¹^*

¹Shanghai Municipal Center for Disease Control and Prevention, Shanghai, China
²Department of Environmental Health, Key Laboratory of the Public Health Safety, Ministry of Education, School of Public Health, Fudan University, Shanghai, China

Background: People usually spend most of their time indoors, so indoor fine particulate matter (PM_2.5) concentrations are crucial for refining individual PM_2.5 exposure evaluation. The development of indoor PM_2.5 concentration prediction models is essential for the health risk assessment of PM_2.5 in epidemiological studies involving large populations.

Methods: In this study, based on the monitoring data of multiple types of places, the classical multiple linear regression (MLR) method and random forest regression (RFR) algorithm of machine learning were used to develop hourly average indoor PM_2.5 concentration prediction models. Indoor PM_2.5 concentration data, which included 11,712 records from five types of places, were obtained by on-site monitoring. Moreover, the potential predictor variable data were derived from outdoor monitoring stations and meteorological databases. A ten-fold cross-validation was conducted to examine the performance of all proposed models.

Results: The final predictor variables incorporated in the MLR model were outdoor PM_2.5 concentration, type of place, season, wind direction, surface wind speed, hour, precipitation, air pressure, and relative humidity. The ten-fold cross-validation results indicated that both models constructed had good predictive performance, with the determination coefficients (R²) of RFR and MLR were 72.20 and 60.35%, respectively. Generally, the RFR model had better predictive performance than the MLR model (RFR model developed using the same predictor variables as the MLR model, R² = 71.86%). In terms of predictors, the importance results of predictor variables for both types of models suggested that outdoor PM_2.5 concentration, type of place, season, hour, wind direction, and surface wind speed were the most important predictor variables.

Conclusion: In this research, hourly average indoor PM_2.5 concentration prediction models based on multiple types of places were developed for the first time. Both the MLR and RFR models based on easily accessible indicators displayed promising predictive performance, in which the machine learning domain RFR model outperformed the classical MLR model, and this result suggests the potential application of RFR algorithms for indoor air pollutant concentration prediction.

1. Introduction

PM_2.5 refers to particulate matter with an aerodynamic diameter of 2.5 μm or less, which is one of the environmental pollutants with the greatest impact on public health (1–3). Numerous epidemiological studies have shown that both long-term and short-term exposure to PM_2.5 increases the risk of death from respiratory and cardiovascular diseases in the population (4–6). Studies have shown that for every 10 g/m³ increase in the average concentration of PM_2.5 in ambient air, there is a 3.1% increase in hospital admissions and a 2.5% increase in mortality from chronic obstructive pulmonary disease (7). Furthermore, there is a 3% increase in emergency department visits for bronchial asthma (8), a 16% increase in the risk of death from ischemic heart disease, and a 14% increase in mortality from stroke (4, 9).

Currently, most relevant studies use ambient PM_2.5 concentrations as a surrogate for human PM_2.5 exposure without taking into account the difference between indoor and outdoor PM_2.5 concentrations as well as the contribution of indoor PM_2.5 exposure to actual human exposure, which limits the interpretation of their results. As most people spend at least 80% of their day indoors, and for some specific populations such as the older adults and children, this percentage is even higher (10–12). Therefore, indoor PM_2.5 concentration is crucial for accurate PM_2.5 exposure assessment and health risk assessment. Direct measurement of indoor PM_2.5 concentration can provide the most accurate data; however, such practice is not easy to achieve, as it requires a lot of manpower and material resources as well as the compliance of the research participants, especially for large-scale population and/or long-term studies. When direct measurement is difficult to achieve, it is important to construct appropriate predictive models.

At present, many studies have been conducted to establish prediction models for indoor PM_2.5 concentration (12–18), mainly involving multiple linear regression (MLR) models and random forest regression (RFR) models, which have their own advantages and disadvantages. For indoor PM_2.5 concentration, there is still controversy about which model has a better predictive effect. In addition, the models in these studies have mostly predicted the average indoor PM_2.5 concentration on one or more days, and do not adequately account for the fluctuation of indoor PM_2.5 concentration during the day (or longer) and the variability of individual behaviors over time (19–21). Obviously, the establishment of indoor PM_2.5 concentration prediction models with higher temporal resolution is of more practical significance to improve individual PM_2.5 exposure assessment. The existing models were constructed using indoor PM_2.5 concentration monitoring data from a single type of place, which is not universal enough and inevitably limits the practical application to different types of places. No study has yet established prediction models for hourly average indoor PM_2.5 concentration based on data from multiple types of places.

In this study, monitored data on indoor PM_2.5 concentrations from five types of typical sites (offices, primary and secondary schools, kindergartens, shopping malls, and restaurants) in Shanghai were collected during different seasons. The data were used to develop and evaluate predictive MLR and RFR models for indoor PM_2.5 temporal average concentrations based on multiple types of places. The aim of the study was to provide a feasible way to improve individual PM_2.5 exposure assessment.

2. Materials and methods

2.1. Data collection

Five types of typical locations – offices, middle and primary schools, kindergartens, shopping malls, and restaurants – were selected for indoor PM_2.5 concentration field monitoring in 16 districts of Shanghai. A TSI DustTrak 8,530 benchtop aerosol monitor (TSI Incorporated, Shoreview, MN, United States) was used for the monitoring. One floor was selected as the monitoring site for the high, middle, and low areas of office buildings, shopping malls, and restaurants. Two, four, and six monitoring points were set for indoor areas of 200–1,000 m², 1,001–5,000 m², and over 5,000 m², respectively. Two classrooms from each floor were used as monitoring sites in high, middle, and low areas of kindergartens, middle, and primary schools. One, three, and five monitoring points were set for indoor areas of less than 50 m², 50–100 m², and more than 100 m², respectively. All of the above points were distributed evenly on the diagonal of the room or in a plum style, and the height of each point was set at the level of a human respiratory belt (0.8–1.2 m). The actual measurement time was in January, April, July, and October of 2018 (the 4 months represented the four seasons of the year: January for winter, April for spring, July for summer, and October for autumn). Indoor PM_2.5 concentrations in each location were monitored for 1 week during these 4 months, with each instrument monitoring the concentrations every 15 min, which covered all times of the day (00,00–23,00 h) to ensure full coverage of people’s activities in various places as much as possible.

For the construction of prediction models, we used the findings of relevant publications (17, 21–24) to identify 11 easily accessible indicators that may have significant effects on indoor PM_2.5 concentrations. The relevant information of the indicators could be found in Supplementary Table S1. The outdoor PM_2.5 and PM₁₀ concentration data were obtained from the monitoring stations of 16 municipal control points in Shanghai. By calculating the distance between all government-controlled monitoring stations and the indoor places we monitored, the data from the closest station was selected as outdoor PM_2.5 and PM₁₀ concentration data for indoor places. Meteorological data for the same period were obtained from the European Center for Medium and Long-Range Weather Forecasts, which included outdoor temperature, relative humidity, air pressure, precipitation, surface wind speed, and wind direction.

2.2. Data analysis

The data analysis in this study was based on the arithmetic mean of time, that is, the indoor and outdoor PM_2.5 concentrations, outdoor PM₁₀ concentration, as well as related meteorological parameters were processed as hourly mean values for use. For example, the indoor PM_2.5 concentration at 09:00 h was actually the mean value of 08:00 h to 09:00 h. Following a series of data washing, the final database consisted of 11,712 records, 11 potential predictor variables, and natural log-transformed indoor PM_2.5 concentrations (approximately normally distributed) as response variables for MLR and RFR model construction. Data analysis and model construction in this study were performed with R software (version 4.1.0), and statistical significance levels were set at p values of <0.01 and < 0.05 (both sides).

2.3. MLR model construction steps

A sensitivity analysis was conducted for the effects of different variable screening methods on the predictive efficacy of MLR models. The three adopted types of variable screening were as follows: 1) manually supervised forward linear regression commonly used in reference to classical land-use regression modeling (25, 26), 2) stepwise regression (backward, variables with regression coefficient p < 0.05 were retained), and 3) least absolute shrinkage and selection operator (Lasso). The manually supervised forward linear regression method was used to build a basal multiple regression model in three steps: 1) After testing the premise assumptions of the regression model, all potential predictor variables expected to be included in the model were first univariately regressed against the response variable (natural log-transformed hourly average PM_2.5 concentration), and predictor variables with significant (p < 0.05) regression coefficients were retained for the next step, 2) Correlations between prediction variables were tested. Among the prediction variables that were highly correlated with other prediction variables (Spearman r > 0.50, p < 0.05), only the prediction variable with the highest coefficient of determination (R²) was retained for further analysis, 3) The predictor variables that remained after the previous two steps were sorted according to R² (from highest to lowest), and then each predictor was entered into the regression model in order. Finally, only those predictor variables with significant partial regression coefficients (p < 0.05), which boosted the R² of the model by more than 1% and whose coefficients were consistent with the priori hypothesis (such as a positive coefficient of outdoor PM_2.5), were retained.

In the process of MLR model diagnosis, variance inflation factors of the predictive variables were tested to evaluate multicollinearity. Additionally, considering that season may modify the effects of other potential predictor variables on indoor PM_2.5 concentration, we stratified the data by winter–spring (January, April) and summer-autumn (July, October) seasons and developed season-specific prediction models.

2.4. RFR model construction steps

Random forest model is a machine learning model that realizes the classification and/or prediction for unknown samples through the integrated learning with a large number of decision trees, which is now widely used in the processing of big data due to its fast computing speed, high prediction accuracy, and strong anti-interference (27–29). This model possesses two significant characteristics, namely sample randomization and variable randomization. Bagging algorithm is the basis of the random forest model, which is also known as bootstrap sampling algorithm, in short, there is put back to the random collection of samples to form a different set of data to train the base learner, so as to realize the mutual independence of individual learners. The Random Forest algorithm extends and expands the Bagging algorithm. In addition to random sampling of samples, the Random Forest algorithm also incorporates random selection of variables at each attribute node of the classification tree, which further enhances the diversity of each decision tree, reduces the risk of model overfitting, and can effectively improve the generalization performance of the final ensemble model (27, 29). The prediction accuracy and generalization of a Random Forest model are closely related to two important hyperparameters, which are ntree (the number of trees used) and mtry (the number of variables used for binary trees in the specified nodes). The randomForest package of R software (version 4.1.0) was used to construct the RFR model. In our analysis, different values were set for these two parameters as sensitivity analysis in order to obtain maximum model prediction effectiveness. The increase in mean squared error (%IncMSE) of the predicted value was taken as an indicator to measure the importance of a variable, in other words, a random value was assigned to each prediction variable. If the prediction variable is important, the prediction error of the model will increase after its value is randomly replaced, so the larger the value, the more important the variable is.

In order to evaluate and compare the prediction efficiency of the MLR model and the RFR model for indoor hourly average PM_2.5 concentration in various types of places, we developed two RFR models. The first RFR model was called the Full variables-RFR model (Full-RFR). Since the RFR model does not need to consider preconditions such as the independence of predictive variables that are faced by general MLR models, all 11 potential predictive variables were included in the model. The second RFR model was called the Conjoint-RFR model (Conjoint-RFR). In order to compare the MLR and RFR models, this Conjoint-RFR model was established using the same predictor variables as the MLR model with the best prediction performance identified in the previous steps.

2.5. Evaluation of models

The R² and root mean squared error (RMSE) calculated based on the predicted and measured values of the model were used as the model performance evaluation indexes. In addition, the generalization performance of the model was evaluated by a ten-fold cross-validation (CV) method. In short, the entire dataset was randomly and equally divided into ten subsets, nine of which were selected as the training set and the remaining one was used as the test set to test the prediction performance of the model. This process was repeated 10 times until each subset was used for one verification (30).

3. Results

3.1. Indoor PM_2.5 pollution in various places

The summary of hourly average indoor PM_2.5 concentration statistics for each site was shown in Table 1. In general, the median hourly average indoor PM_2.5 concentration was 34.9 μg/m³ and the interquartile range was 24.5 μg/m³, with a few readings on the high side and a maximum value of 288 μg/m³. The result of Welch analysis of variance (Welch ANOVA) (31) showed significant differences (p < 0.01) in the hourly average indoor PM_2.5 concentrations in different types of places. The highest hourly average indoor PM_2.5 concentrations were found in restaurants (44.4 μg/m³), probably because of frequent cooking in restaurants that produces a large amount of grease smoke and causes indoor PM_2.5 concentrations to increase (32). The Ambient Air Quality Standards (GB 3095–2012) of China and the Environmental Protection Agency of the United States have set the daily average ambient PM_2.5 concentration limit at 35 μg/m³. No clearly established indoor PM_2.5 concentration standard exists in China; therefore, the daily average ambient PM_2.5 concentration standard and the classification method of the China Environmental Monitoring Station were used here to characterize the indoor PM_2.5 pollution in each location (Figure 1). In terms of 35 μg/m³ as the standard, indoor PM_2.5 exceeded the standard in different degrees in all places and restaurants were the worst offender, followed by kindergartens. The monitoring results suggest that the indoor environmental quality of these two types of places needs to be improved.

TABLE 1

Table 1. Hourly average indoor PM_2.5 concentrations in each place (μg/m³).

FIGURE 1

Figure 1. Daily average indoor PM_2.5 concentrations (μg/m³) in each place. In reference to the classification method of the China Environmental Monitoring Station: 0–35 μg/m³ is excellent; 35–75 μg/m³ is good; 75–115 μg/m³ is light pollution; and 115–150 μg/m³ is moderate pollution. PM_2.5 refers to particulate matter with an aerodynamic diameter of 2.5 μm or less.

The changes of hourly average indoor PM_2.5 concentration at different times are shown in Figure 2. Overall, there were significant differences (p < 0.01) in indoor PM_2.5 at different times of the day, and we also observed significant intraday fluctuations in the monitoring data for each type of place (p < 0.05). The variability of PM_2.5 concentration at different times of the day in multiple types of places is closely related to the nature of the place. For example, the fluctuation of PM_2.5 concentration in the restaurant was as expected (p < 0.01), with two peaks occurring after 11:00 and after 17:00, which are roughly the beginning of lunch and dinner. At these times, intensive cooking leads to higher indoor PM_2.5 concentrations, and similar patterns were observed in other places (Figure 2). These results demonstrate the intraday variability of indoor PM_2.5 concentration as well as the spatial variability across places.

FIGURE 2

Figure 2. Variation of intraday hourly average indoor PM_2.5 concentration in each place (μg/m³). PM_2.5 refers to particulate matter with an aerodynamic diameter of 2.5 μm or less.

3.2. MLR model results

Univariate regression model results for hourly average indoor PM_2.5 concentration were summarized in Supplementary Table S2. All 11 prediction variables were significantly associated with hourly average indoor PM_2.5 (p < 0.05). The R² of the 11 prediction variables ranged from 0.1 to 30.54%, among which nine variables exceeded 2%, with the largest R² for outdoor PM_2.5 concentration (30.54%), followed by outdoor PM₁₀ concentration (R² = 28.76%), season (R² = 24.05%), type of place (R² = 17.11%), and wind direction (R² = 8.64%). The final MLR model for log-transformed hourly average indoor PM_2.5 concentrations were shown in Table 2. The model which was developed based on the stepwise regression method had the best prediction performance (CV R² = 60.48%) and the lowest prediction error (CV RMSE = 0.44) among the three MLR models (Table 3). In this paper, the relative importance of the predictor variables within MLR model was determined using the “Lindeman, Merenda and Gold (LMG).” LMG was evaluated as the most successful indicator of the relative importance of independent variables, which was implemented by using the “relaimpo” package of R software (33, 34) (Figure 3). Outdoor PM_2.5 concentration was the most important predictor variable, with an R² share of 33.91%, followed by type of place (27.62%), season (26.22%), wind direction (4.88%), and surface wind speed (2.80%). The two models developed after stratification by winter–spring and summer-autumn incorporated similar predictor variables, of which the R² and RMSE after cross-validation were also remarkably close (winter–spring model: R² = 58.23%, RMSE = 0.38; summer-autumn model: R² = 58.79%, RMSE = 0.49; Supplementary Tables S3–S5).

TABLE 2

Table 2. Multiple linear regression (MLR) model for log-transformed hourly average indoor PM_2.5.

TABLE 3

Table 3. Summary of model performance evaluation results.

FIGURE 3

Figure 3. Relative importance of the multiple linear regression (MLR) model predictor variables. R², coefficient of determination; PM_2.5 refers to particulate matter with an aerodynamic diameter of 2.5 μm or less.

3.3. RFR model results

We compared and analyzed all RFR models with ntree of 200, 500, 1,000 and mtry of 1 ~ 11 (Supplementary Figure S1), and finally determined that ntree = 200 and mtry = 2 were the most suitable RFR parameters for this study after fully considering the model’s prediction effectiveness, prediction error, and model efficiency. Results from the Conjoint-RFR model, which used the same predictor variables as the MLR model, showed that the RFR model explained a greater proportion of the variance of indoor PM_2.5 time-averaged concentrations with an R² (RMSE) of 89.65% (0.23), which decreased in predictive efficacy (CV R² = 71.86%) and increased in prediction error (CV RMSE = 0.37) after ten-fold cross-validation. Nevertheless, the overall performance of the model was still better than that of the corresponding MLR model (CV R² = 60.48; CV RMSE = 0.44). The performance of the Full-RFR model incorporating all predictor variables was better than that of the Conjoint-RFR model, with a CV R² (RMSE) of 72.20% (0.36). The importance results of the predictor variables from the random forest algorithm (Figures 4A,B) indicated that the top five variables in the Conjoint-RFR model (Figure 4B) in order of importance were type of place, outdoor PM_2.5 concentration, season, hour, and surface wind speed. Comparison of the importance ranking results of the variables in the Conjoint-RFR model and the corresponding MLR model shows that the top three variables in both models are the same, namely, outdoor PM_2.5 concentration, type of place, and season, but with a different order. By contrast, the variable “hour” appears in the top five variables in the Conjoint-RFR model but wind direction is in the top five in the MLR model.

FIGURE 4

Figure 4. The importance of the predictor variables in random forest regression (RFR) models based on “%IncMSE.” Full-RFR model (A), Conjoint-RFR model (B). PM_2.5 and PM₁₀ refer to particulate matter with an aerodynamic diameter of 2.5 μm or less and of 10 μm.

4. Discussion

Significant differences in indoor PM_2.5 concentrations between various types of places and at different times of day were found in our study. The variable of “type of place” ranked first and second in the importance assessment of the predictor variables of the RFR model and the MLR model in this study, respectively. This result emphasized the importance of place type in predicting indoor PM_2.5 concentration and suggested that it might be difficult to extrapolate the prediction model based on a single type of place for use in other types of places. In fact, it is not difficult to understand the conclusion that the different functional attributes of each place naturally create a unique indoor microenvironment, which consequently affects the occurrence, diffusion, deposition and other behaviors of PM_2.5 (35–38). For example, in an office, there is a high concentration of people, frequent use of office equipment (e.g., printers, photocopiers and computers), and air-conditioning equipment (air-conditioners, humidifiers, air filters), with low ventilation and a single source of indoor pollution, whereas in a shopping mall there is a higher flow of people, more frequent ventilation, and a more complex internal environment. In contrast, the frequent cooking activities in restaurants generate smoke and high temperatures, creating a different microenvironment than the places mentioned above (35, 39). However, the currently available prediction models for indoor PM_2.5 concentrations are all constructed based on monitoring data from a single type of place, such as residential buildings (16, 18, 40, 41), schools (19, 20), and offices (42), without considering the differences between various types of sites. This situation inevitably leads to limitations in the actual application for the assessment of indoor PM_2.5 exposure. At present, no study has attempted to construct an indoor PM_2.5 concentration prediction model based on monitoring data from multiple types of places, and our study has attempted to fill this gap. In addition, most existing studies have predicted indoor PM_2.5 concentration over a day or longer period (such as a week); however, many published studies have shown that indoor PM_2.5 concentrations have a large daily variability (19–21). According to a report by Che et al. (43), after conducting continuous monitoring of indoor air quality in 32 primary and secondary schools across Hong Kong, it was found that there were significant variations in PM_2.5 concentrations in classrooms at different times of the day. The PM_2.5 concentrations in classrooms during school hours were approximately 40% higher than non-school hours. Zhao et al. (44) reported that indoor PM_2.5 concentrations were 1.5 times higher at night than during the daytime in Beijing during winter. According to Xu et al. (13), indoor PM_2.5 concentrations at different moments of the day varied significantly, with the ratio of the highest to the lowest values even exceeding 15-fold. This temporal variability of indoor PM_2.5 may originate from outdoor sources, for example, factors such as changes in outdoor PM_2.5 concentrations, variations in wind direction, temperature, and atmospheric pressure throughout the day and night may contribute to the differences in indoor PM_2.5 concentrations (17, 44), or from indoor human activities, such as cooking, smoking, use of air purifiers, etc. (45, 46). No matter what causes this variability, establishing a higher temporal resolution in an indoor PM_2.5 concentration prediction model is more practical for refining individual PM_2.5 exposure assessment and health risk evaluation.

MLR models are widely used for indoor air quality prediction because of the advantages of simple methodology, easy application, and strong interpretation of results (13, 17, 47). However, prerequisites exist for MLR application. First, a linear relationship must exist between the prediction variable and the response variable. Second, the response variable must obey a normal distribution when each predictor variable takes a certain definite value. Third, the response variable must satisfy the homogeneity of variance when each predictor variable takes different values. Fourth, the predictor variables are independent of each other and do not have a very close statistical correlation. These prerequisites for MLR in practical applications are sometimes not easily satisfied.

With improvements in computing power and the advent of the era of big data, machine learning algorithms have been constantly enhanced and widely focused. The random forest algorithm is an integrated decision tree-based algorithm proposed by Breiman and Cutler in 2001, which can simultaneously construct a large number of decision trees in parallel and achieve significantly higher computational efficiency than other machine learning methods by integrating the learning of multiple decision trees (27, 29). Due to the inherent inclusion of interactions between variables in the random forest algorithm, there is no need to consider the issue of multicollinearity among variables in general models, and the algorithm performs robustly with mixed data types, missing data, non-equilibrium data, and extreme data, leading to a high prediction accuracy of the model (28). In addition, owing to the inclusion of sample perturbation and attribute perturbation in the algorithm, the random forest model can effectively limit overfitting and is regarded as one of the best algorithms today (48–50). Of course, random forest models also have certain drawbacks, such as poor interpretability of the model, which is usually considered as a black box model. Furthermore, categorical variables with more levels will have a greater impact on the model results than those with fewer levels, which may lead to a deviation in the prediction results (48, 51).

In our study, MLR and RFR prediction models were developed for hourly average indoor PM_2.5 concentrations based on monitoring data from multiple types of places. As a conventional and classical prediction model, the MLR model is widely used to predict indoor PM_2.5 concentration. Our MLR model (CV R² = 60.48%) had a relatively high predictive performance compared with published MLR prediction models of indoor PM_2.5 concentration based on 1 day or longer (such as 1 week) whose R² values ranged from 33 to 87% (13, 16, 18, 19, 52–54). To the best of our knowledge, only one study by Xu et al. (13) has developed an MLR prediction model for hourly average indoor PM_2.5 concentration. In this study, two MLR models were developed for two regions with CV R² values of 71 and 75%.

The two CV R² values in the study by Xu et al. (13) indicated better model predictive performance than for our MLR model. This difference might be because the model development in our paper was based entirely on easily accessible temporal indicators and outdoor indicators. By contrast, the model construction in the study by Xu et al. (13) incorporated not only outdoor indicators (such as outdoor PM_2.5 concentration and outdoor relative humidity) but also indoor indicators (such as indoor smoking and cooking), with a wide range of indicator coverage. However, the model in that study also suffered from difficulties in the definition of relevant indicators, such as “whether or not to cook.” In fact, cooking ingredients, cooking methods, cooking time, and the type of oil used have significant effects on indoor PM_2.5 concentration (55, 56). Moreover, these types of prediction indicators were not easy to obtain and the process was costly. Only several studies have developed RFR prediction models for indoor PM_2.5 concentration, and the CV R² values have ranged from 48.9 to 82% in these studies (13, 16, 18). The predictive efficacy of the Full-RFR model in this study (CV R² = 72.20%) was also at a high level.

MLR and RFR models, as common indoor PM_2.5 concentration prediction models, are still controversial in terms of which approach can better predict indoor PM_2.5 concentrations. Previous studies have shown (16, 44) that using the same dataset, an RFR model usually outperforms an MLR model in terms of predictive efficacy owing to the strength of the algorithm itself, such as robustness to missing data and good characterization of interactions between different predictor variables. However, some studies have reached the opposite conclusion, as in the study by Yuchi et al. (18). In their study, two models had the same variables for the same dataset, and the MLR model (CV R² = 50.2%) outperformed the RFR model (CV R² = 48.9%) in terms of generalization performance. This issue was also explored in the current study, as the results of our sensitivity analysis for the modeling algorithm showed that the Full-RFR model, which used all predictor variables, and the Conjoint-RFR model, which used the same predictor variables as MLR, both performed better than the MLR model.

Compared with other studies, the current study had several strengths. First, the indoor PM_2.5 concentration monitoring data based on multiple types of places were used for modeling, which was more generalizable for predicting indoor PM_2.5 concentration than the models developed using data from a single type of place. Second, we developed modeling with high temporal resolution indoor PM_2.5 concentration data (hourly average data), which fully took into account the temporal variability of indoor PM_2.5. Third, the sample size used for modeling was sufficiently large (n = 11,712) to greatly exceed the number of predictor variables (11), so that the model was less prone to overfitting. Fourth, the model prediction cost was low, and the predictor variables in the model were all easy to obtain. For example, outdoor PM_2.5 concentration, wind direction, and surface wind can be found through the websites of relevant government departments. The model is suitable for epidemiological studies with large populations and/or long time periods.

Of course, there were some limitations in the study. First, the outdoor PM_2.5 concentration data of indoor places in the study were obtained from the nearest government-controlled monitoring sites. Although this approach has been used in many previous studies, it could introduce some errors in the model due to the spatial variability of outdoor PM_2.5 concentrations. Second, the absence of human indoor activity variables, such as smoking and cooking, might cause an increase in the prediction error of the model at certain time periods and contexts, for instance, during cooking and when air purifiers were used. Third, the model was developed and evaluated based on data from Shanghai, and there was a lack of equivalent data from other regions for further validation of model performance.

5. Conclusion

We found significant differences in indoor PM_2.5 concentration between types of places and time periods. This finding reflects the possible limitations of models based on indoor PM_2.5 concentration data from a single type of place as well as the necessity for a prediction model with a high temporal resolution in order to perfect individual PM_2.5 exposure assessment. Here, we aimed to develop MLR and RFR models for hourly average indoor PM_2.5 concentration over multiple types of places. Both statistical models were based on easy-to-access indicators and showed good predictive efficacy. They could, therefore, be used for quantitative estimation of indoor PM_2.5 exposure in large-scale population studies. In addition, the performance of the classical MLR model and machine learning RFR model were evaluated comparatively in predicting indoor PM_2.5 concentration, and the model performance metrics showed that the RFR model using the same dataset outperformed the MLR model. This finding suggests the potential of RFR models in predicting indoor air pollutant levels, and other machine learning algorithms may also be worthy of exploration.

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Author contributions

YS, ZD, CD, and SS designed the research. YS, FH, FC, DW, ML, and HZ performed data acquisition, organization and a part of analysis. ZD and YS analyzed the data and wrote the primary manuscript. YS, ZD, CD, SS, FH, and FC provided a contribution to the explanation of the findings and critically reviewed and edited the manuscript. All authors contributed to the article and approved the submitted version.

Funding

This work was supported by Shanghai Municipal Health Commission Science and Research Fund (No. 202040185).

Acknowledgments

The authors would like to thank all the personnel at the monitoring sites for their cooperation.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpubh.2023.1213453/full#supplementary-material

References

1. Cohen, AJ, Brauer, M, Burnett, R, Anderson, HR, Frostad, J, Estep, K, et al. Estimates and 25-year trends of the global burden of disease attributable to ambient air pollution: an analysis of data from the global burden of diseases study 2015. Lancet Lond Engl. (2017) 389:1907–18. doi: 10.1016/S0140-6736(17)30505-6

PubMed Abstract | CrossRef Full Text | Google Scholar

2. Ritz, B, Hoffmann, B, and Peters, A. The effects of fine dust, ozone, and nitrogen dioxide on health. Dtsch Ärztebl Int. (2019) 51-52:881–6. doi: 10.3238/arztebl.2019.0881

PubMed Abstract | CrossRef Full Text | Google Scholar

3. Yang, L, Li, C, and Tang, X. The impact of PM2.5 on the host Defense of respiratory system. Front Cell Dev Biol. (2020) 8:91. doi: 10.3389/fcell.2020.00091

PubMed Abstract | CrossRef Full Text | Google Scholar

4. Yang, H, Li, S, Sun, L, Zhang, X, Cao, Z, Xu, C, et al. Smog and risk of overall and type-specific cardiovascular diseases: a pooled analysis of 53 cohort studies with 21.09 million participants. Environ Res. (2019) 172:375–83. doi: 10.1016/j.envres.2019.01.040

PubMed Abstract | CrossRef Full Text | Google Scholar

5. Kaufman, JD, Adar, SD, Barr, RG, Budoff, M, Burke, GL, Curl, CL, et al. Association between air pollution and coronary artery calcification within six metropolitan areas in the USA (the multi-ethnic study of atherosclerosis and air pollution): a longitudinal cohort study. Lancet Lond Engl. (2016) 388:696–704. doi: 10.1016/S0140-6736(16)00378-0

PubMed Abstract | CrossRef Full Text | Google Scholar

6. Pinault, L, Tjepkema, M, Crouse, DL, Weichenthal, S, van Donkelaar, A, Martin, RV, et al. Risk estimates of mortality attributed to low concentrations of ambient fine particulate matter in the Canadian community health survey cohort. Environ Health Glob Access Sci Source. (2016) 15:18. doi: 10.1186/s12940-016-0111-6

PubMed Abstract | CrossRef Full Text | Google Scholar

7. Li, M-H, Fan, L-C, Mao, B, Yang, J-W, Choi, AMK, Cao, W-J, et al. Short-term exposure to ambient fine particulate matter increases hospitalizations and mortality in COPD: a systematic review and meta-analysis. Chest. (2016) 149:447–58. doi: 10.1378/chest.15-0513

PubMed Abstract | CrossRef Full Text | Google Scholar

8. Fan, J, Li, S, Fan, C, Bai, Z, and Yang, K. The impact of PM2.5 on asthma emergency department visits: a systematic review and meta-analysis. Environ Sci Pollut Res. (2016) 23:843–50. doi: 10.1007/s11356-015-5321-x

PubMed Abstract | CrossRef Full Text | Google Scholar

9. Hayes, RB, Chris, L, Yilong, Z, Kevin, C, Yongzhao, S, Reynolds, HR, et al. PM2.5 air pollution and cause-specific cardiovascular disease mortality. Int J Epidemiol. (2020) 49:25–35. doi: 10.1093/ije/dyz114

PubMed Abstract | CrossRef Full Text | Google Scholar

10. Gauvin, S, Reungoat, P, Cassadou, S, Déchenaux, J, Momas, I, Just, J, et al. Contribution of indoor and outdoor environments to PM2.5 personal exposure of children—VESTA study. Sci Total Environ. (2002) 297:175–81. doi: 10.1016/S0048-9697(02)00136-5

PubMed Abstract | CrossRef Full Text | Google Scholar

11. Rivas, I, Fussell, JC, Kelly, FJ, and Querol, X. Indoor sources of air pollutants. Issues Environ Sci Technol. (2019) 20:1–34. doi: 10.1039/9781788016179-00001

Construction and evaluation of hourly average indoor PM2.5 concentration prediction models based on multiple types of places

1. Introduction

2. Materials and methods

2.1. Data collection

2.2. Data analysis

2.3. MLR model construction steps

2.4. RFR model construction steps

2.5. Evaluation of models

3. Results

3.1. Indoor PM2.5 pollution in various places

3.2. MLR model results

3.3. RFR model results

4. Discussion

5. Conclusion

Data availability statement

Author contributions

Funding

Acknowledgments

Conflict of interest

Publisher’s note

Supplementary material

References

94% of researchers rate our articles as excellent or good

Construction and evaluation of hourly average indoor PM_2.5 concentration prediction models based on multiple types of places

3.1. Indoor PM_2.5 pollution in various places