Classification prediction model of indoor PM2.5 concentration using CatBoost algorithm

Guo, Zhenwei; Wang, Xinyu; Ge, Liang

doi:10.3389/fbuil.2023.1207193

ORIGINAL RESEARCH article

Front. Built Environ., 31 July 2023

Sec. Indoor Environment

Volume 9 - 2023 | https://doi.org/10.3389/fbuil.2023.1207193

This article is part of the Research TopicIndoor Environmental Air Quality in Urban AreasView all 8 articles

Classification prediction model of indoor PM_2.5 concentration using CatBoost algorithm

Zhenwei Guo^1,2

Xinyu Wang¹*

Liang Ge¹

¹Chinese Society for Urban Studies, Beijing, China
²National Engineering Research Center of Building Technology, Beijing, China

It is increasingly important to create a healthier indoor environment for office buildings. Accurate and reliable prediction of PM_2.5 concentration can effectively alleviate the delay problem of indoor air quality control system. The rapid development of machine learning has provided a research basis for the indoor air quality system to control the PM_2.5 concentration. One approach is to introduce the CatBoost algorithm based on rank lifting training into the classification and prediction of indoor PM_2.5 concentration. Using actual monitoring data from office building, we consider previous indoor PM_2.5 concentration, indoor temperature, relative humidity, CO₂ concentration, and illumination as input variables, with the output indicating whether indoor PM_2.5 concentration exceeds 25 μg/m³. Based on the CatBoost algorithm, we construct an intelligent classification prediction model for indoor PM_2.5 concentration. The model is evaluated using actual data and compared with the multilayer perceptron (MLP), gradientboosting decision tree (GBDT), logistic regression (LR), decision tree (DT), and k-nearest neighbors (KNN) models. The CatBoost algorithm demonstrates outstanding predictive performance, achieving an impressive area under the ROC curve (AUC) of 0.949 after hyperparameters optimition. Furthermore, when considering the five input variables, the feature importance is ranked as follows: previous indoor PM_2.5 concentration, relative humidity, CO₂, indoor temperature, and illuminance. Through verification, the prediction model based on CatBoost algorithm can accurately predict the indoor PM_2.5 concentration level. The model can be used to predict whether the indoor concentration of PM_2.5 exceeds the standard in advance and guide the air quality control system to regulate.

1 Introduction

1.1 Motivation

More than 80% of people’s time is spent indoors, and in recent years, with the prevalence of respiratory infectious diseases and increasingly severe outdoor environmental pollution, this proportion will continue to rise. Office buildings are the main places where people work, and existing research shows that indoor air quality (IAQ) directly affects the physical and mental health of indoor personnel. High-quality indoor environmental quality can promote work efficiency and health (Fisk, 2000; Newsham et al., 2008; Thayer Julian et al., 2010; Horr et al., 2016). In the era of the COVID-19 pandemic, compared to outdoor, commercial, and hotel buildings, the requirements for wearing masks indoors in office buildings are relatively low. Therefore, IAQ in office buildings is extremely important.

PM_2.5 is particle size ≤ 2.5 μm fine particles, which are composed of many chemical components, are easy to absorb harmful substances, and are the main pollutants affecting the IAQ. Particulate pollution mainly harms the respiratory system of human body, can cause tracheitis, pneumonia and fever, and may also cause eye and nasal allergy, or even death (BellMichelle and Davis, 2001; Jacobs et al., 2018; Mo, 2019). As the largest developing country in the world and one of the largest anthropogenic air pollution emitters in the world, China is currently facing serious air pollution problems, mainly PM_2.5. The number of PM_2.5-related deaths in China in 2017 was 1.8 million, an increase of 30% over 2005 (Ming et al., 2021). The hazard of indoor PM_2.5 has gradually attracted people’s attention, and it has been included in the Standard for Health Building Assessment as a control item (GB, 2002). Research has revealed that indoor PM_2.5 concentrations occasionally surpass 75 μg/m³ on a daily basis, exceeding the guidelines set by the World Health Organization for IAQ (Zhao et al., 2015; Fan et al., 2018; XueWangLiu and Dong, 2020). To foster a healthy indoor environment, certain indoor office spaces have adopted the use of IAQ improvement devices to ensure the provision of clean air. Fresh air system is the main method for HVAC to control indoor air pollution, which can significantly reduce the concentration of PM_2.5 (ZhaoLiuRen, 2018; Huang et al., 2020). While the fresh air systems effectively lower indoor PM_2.5 concentrations to satisfactory levels, a notable concern is that the majority of these systems do not adapt their operation in response to the IAQ conditions, often remaining fully operational at all times (Lai et al., 2018). This lack of responsiveness is both unnecessary and energy-inefficient.

The application of artificial intelligence (AI) and machine learning (ML) has emerged as a promising solution to tackle the aforementioned challenges. With the gradual maturity of computer technology and automatic control theory, intelligent control technology based on ML has been widely used in the field of HVAC regulation. By combining intelligent predictive models with IAQ control devices, it becomes possible to accurately forecast PM_2.5 concentrations and translate them into control signals, thereby guiding the regulation process. This integration not only facilitates the creation of a healthy indoor environment but also minimizes unnecessary energy consumption within the systems.

1.2 The application of ML in IAQ prediction

Over the past decade, remarkable advancements have been made in applying AI/ML and Internet of Things (IoT) technologies to monitor and predict the physical environment of buildings. Specifically, ML prediction models for IAQ have been established based on historical data collected from sensors. Various methods have been employed to enhance the accuracy of IAQ prediction, including multivariate linear regression (MLR) (MengSpectorColome and Turpin, 2009; Martin and Šafránek, 2011; Maher Nor et al., 2015), artificial neural networks (ANN) (Sofuoglu, 2008; Xie et al., 2009; Skön et al., 2012), random forests (RF) (Kropat et al., 2015; Yuchi, 2017; Yuchi et al., 2019), partial least squares (PLS) (KimSankararaoKang et al., 2012; Lim et al., 2012; LeeKimKim and Yoo, 2015), decision trees (DT) (Kropat et al., 2015; Choi et al., 2017; Yuchi et al., 2019), among others. Additionally, less commonly used ML models have been explored for IAQ prediction. For instance, Justin et al. (2023) developed a Long Short-Term Memory prediction model using physical data observed by IAQ sensors, achieving an accuracy of approximately 60%–80% in determining real-time and near-term concentrations of indoor bioaerosols and PM, surpassing regression models with an accuracy of around 90%. YeganehMotlaghRashidi and Kamalan (2012) combined PLS with support vector machines (SVM) to predict the daily average value of CO, resulting in satisfactory outcomes. Carlos et al. (2018) used a kernel regression model to forecast CO₂ concentration, leveraging continuous data obtained through the Internet, which yielded favorable predictive results. These findings underscore the potential of ML in IAQ prediction.

1.3 The application of ML in PM_2.5 prediction

PM_2.5, being a significant pollutant influencing IAQ, has garnered considerable attention. Initially, mechanical models were employed for PM_2.5 prediction. However, these models lacked convenience as they necessitated detailed inputs, including indoor sources and sinks of PM_2.5, building envelope structures, ventilation conditions, and outdoor concentrations (Wei et al., 2019). When the prerequisites for constructing mechanical models are unavailable, data-driven ML models have emerged as a favored approach for prediction. Feng et al. used ANN to predict PM_2.5 (XiaoQiZhu et al., 2015). Cheng et al. (2019) proposed a PM_2.5 prediction method based on multiple example genetic neural networks for hospitals. Kim et al. (2009) used a recurrent neural network (RNN) to predict the daily indoor PM_2.5 concentration in subway stations, achieving a root mean square error (RMSE) of 17.8 μg/m³. The RNN model exhibited superior performance with lower RMSE values and higher accuracy compared to other prediction models. Maher Nor et al. (2015) evaluated indoor PM_2.5 concentrations in naturally ventilated school buildings using MLR and feed-forward backpropagation (FFBP). The FFBP model outperformed the MLR model in determining indoor PM_2.5. Yuchi et al. (2019) applied the RF method to model indoor PM_2.5 concentrations in Mongolian apartments, demonstrating its superiority over the MLR model, but showing comparable performance in cross-validation. Xu et al. (2020) estimated indoor PM_2.5 concentrations in 66 apartments in China using the RF method, achieving an RMSE of approximately 20 μg/m³ in 10-fold cross-validation. Li et al. (2020) also employed the RF method and successfully estimated indoor PM_2.5 concentrations with a normalized RMSE of 15% in 10-fold cross-validation.

Although ML algorithms have been utilized to predict indoor PM_2.5 concentrations to some extent, there is limited research specifically focused on predicting PM_2.5 concentrations in office buildings (Wei et al., 2019; LagesseWangLarson and Kim, 2020). Moreover, these studies have primarily targeted continuous variables representing PM_2.5 concentrations. According to the Chinese standard “Healthy Building Evaluation Standard” (T/ASC 02-2021), it is desirable to maintain indoor PM_2.5 concentrations below 25 μg/m³ (T/ASC 02-2021, 2021). When the indoor PM_2.5 concentration exceeds this threshold, there is a potential health risk, requiring intervention to improve IAQ. Therefore, it is crucial to predict whether the indoor PM_2.5 concentration exceeds the standard. However, there is currently a lack of research on the classification prediction of PM_2.5 concentrations.

Considering the practical problem of classification prediction, the Boosting algorithm offers an effective solution approach. Boosting algorithm is an integrated learning idea that converts weak learners into strong learners by adding iterations, which can solve the supervised learning classification problem (Susnjak et al., 2012; Sun and Zhou, 2014). Currently, Boosting algorithm is widely used in the photovoltaic power generation prediction field (Imran, 2021; Liu et al., 2021; Yamamoto et al., 2022), business forecasting (Kiki and Vinasetan, 2020; Xie et al., 2021), and medical and healthcare (Amy Isabella et al., 2022; Xue, 2022). And it has been applied in strength prediction of building materials (Lee et al., 2021; Zakir et al., 2022) and accident early warning (Zhou et al., 2021; Guo et al., 2022). CatBoost is one of the main algorithms of the Boosting family of algorithms, with strong robustness and versatility, as well as strong platform applicability and prediction speed (Dorogush et al., 2018). The CatBoost algorithm is based on a gradient boosting decision tree (GBDT) improved efficient ensemble learning idea that uses sorting lifting and symmetric decision trees as weak classifiers (Huang et al., 2019). Through sorting enhancement, the CatBoost algorithm builds an independent integrated model for each sample, avoiding prediction bias caused by information leakage during the training process, and improving prediction accuracy; Through the structural characteristics of symmetric decision trees, the CatBoost algorithm has smaller degrees of freedom, effectively reducing the probability of model overfitting, and significantly improving the prediction speed.

1.4 Contribution

Considering that predicting whether the PM_2.5 concentration will exceed the standard in the next moment is beneficial for regulating indoor air quality. However, there is currently a scarcity of research on utilizing ML algorithms to forecast and classify PM_2.5 concentration. This study aims to develop an intelligent prediction model for PM_2.5 concentration using the CatBoost algorithm. The accuracy and effectiveness of the model are established and validated through the utilization of real monitoring data from office buildings. Additionally, the study aims to assess the efficiency and superiority of the CatBoost algorithm by comparing it with other commonly employed algorithms.

2 Methodology

2.1 Data acquisition and processing

The data used in this paper comes from indoor air quality monitoring platforms in an office of Beijing, China. The monitoring content includes indoor temperature, relative humidity, CO₂ concentration, illuminance, and PM_2.5 concentration. The environmental data is continuously transmitted to the monitoring platform via sensors using a wireless network. The data is sent every 5 min and stored in individual databases for each measurement. The testing range and measurement accuracy of the monitoring instruments are shown in Table 1. The monitoring record interval is 5 min, and each measurement data is stored in a separate database. We collected measurement data between January 18, 2022, and March 29, 2022.

TABLE 1

TABLE 1. Test scope and accuracy of indoor environmental testing indicators.

The relationship between input and output in this study is to predict the classification of PM_2.5 concentration at the current time using previously sampled monitoring data. Based on existing research, we found that when using ML to predict PM_2.5, indoor temperature (Kim et al., 2009; Das et al., 2014; Elbayoumi et al., 2014; Elbayoumi et al., 2015; LiuYoo, 2016; Deepti and Suresh, 2019; DaiLiuLi, 2021), indoor relative humidity (Kim et al., 2009; Elbayoumi et al., 2015; LiuYoo, 2016; Deepti and Suresh, 2019; DaiLiuLi, 2021), CO₂ concentration (Kim et al., 2009; Elbayoumi et al., 2015; LiuYoo, 2016; Deepti and Suresh, 2019; DaiLiuLi, 2021), previous PM_2.5 (Kim et al., 2009; Lim et al., 2012; Jorge et al., 2018; Hyun et al., 2018; Deepti and Suresh, 2019; DaiLiuLi, 2021), were frequently used as input variables. Infer that these parameters are related to indoor PM_2.5. In addition, Ahn et al., 2017 included the influence of light when using deep learning methods to predict IAQ. The concentration of particulate matter may be related to light. Therefore, we will use the previous indoor temperature (t_h-1), relative humidity (d_h-1), CO₂ concentration (C_h-1), light intensity (L_h-1), and PM_2.5 concentration (P_h-1) as input variables for the models.

The indoor PM_2.5 concentration is chaotic and time-varying, strongly correlated with human activities, and it is unrealistic to accurately predict the dynamic concentration of PM_2.5. Real-time and accurate prediction of the PM_2.5 concentration range is an important aspect of creating an indoor environment. Therefore, this paper chooses the PM_2.5 concentration range as the output of the classification model. The Chinese standard “Assessment Standard for Healthy Building" (T/ASC 02-2021) has requirements for the indoor PM_2.5 concentration limit, and in the “Air” section of the control items, it is stipulated that the annual average concentration of indoor PM_2.5 should not be higher than 25 μg/m³. In the scoring items, it is proposed that the annual average concentration of PM_2.5 should not be higher than 15 μg/m³, and the daily average concentration of PM_2.5 should not be higher than 35 μg/m³, and in the “Improvement and Innovation” section of the bonus points, it is proposed that the daily average concentration of PM_2.5 should not be higher than 25 μg/m³ (T/ASC 02-2021, 2021). Based on the above, this paper determines the PM_2.5 concentration as follows: When PM_2.5 ≤ 25 μg/m³, it is determined that indoor air PM_2.5 pollution is relatively low and the current situation is maintained; When PM_2.5 > 25 μg/m³, it is determined that indoor air PM_2.5 pollution is significant and requires purification, as shown in Table 2.

TABLE 2

TABLE 2. Determination of indoor PM2.5 concentration.

2.2 Data preprocessing

Affected by power supply, signal transmission, network and other factors, monitoring equipment will have data quality problems such as missing values and outlier. To avoid interference with the data model, invalid data needs to be cleaned.

The data quality problems of the dataset used in this paper are missing values and outlier, and the specific processing methods are as follows: 1) Variables with a missing rate of more than 30% are regarded as invalid variables, and those with a missing rate of less than 5% are filled forward. For data with missing values of 5%–30%, random forest Multiple Interpolation Model (RFMICE) is used to fill in; 2) The oversize and undersize outlier are identified with 3Sigma criterion and filled forward; For 12 consecutive groups of samples, repeated outlier are regarded as data collection abnormalities and eliminated directly. In order to minimize the impact of outdoor environment and abnormal use on the model, the indoor temperature and CO₂ concentration variables are limited to a reasonable range. Referring to the Chinese standard “Design Code for Heating Ventilation and Air Conditioning of Civil Buildings” (GB 50736-2012), the indoor temperature is higher than or equal to 16°C (GB, 2012). Referring to the Chinese standard “Hygienic Standard for Carbon Dioxide in Indoor Air” (GB/T 17094-1997), indoor CO₂ ranges from 0 to 2000 mg/m³ (GB, 2021).

3 Prediction model

This article establishes a PM_2.5 concentration range classification prediction model based on the CatBoost model, as shown in Figure 1. The specific process is as follows:

(1) Data acquisition and preprocessing. Data is obtained from the platform and preprocessed.

(2) Data Splitting. The preprocessed data is divided into training set, validation set, and test set in the proportions of 70%, 15%, and 15% respectively.

(3) Hyperparameter optimization (HPO). The criteria for hyperparameters selection are determined through an evaluation of their impact on the model’s performance. In addition, based on our comprehensive analysis of previous literature and our own experience (Zhao et al., 2021; Peng et al., 2022), we have observed satisfactory performance of CatBoost model after optimizing the model using the hyperparameters including learning_rate, depth, min_data_in_leaf, bagging_temperature and reg_lambda. To visualize the process of HPO, we utilized the Optuna package (version 2.10.0) (Akiba et al., 2019), an open-source optimization framework. This framework enables us to easily and efficiently implement complex machine learning experiments and perform HPO using Hyperband methods (Li et al., 2016). With Optuna, we can dynamically test various combinations of hyperparameters, allowing for an effective and systematic exploration of the hyperparameter space. This aids in finding the optimal configuration for our machine learning models, enhancing their performance and accuracy.

(4) The best hyperparameter combination is used to predict the PM_2.5 concentration range, and the performance of the model is further demonstrated through time series cross validation.

FIGURE 1

FIGURE 1. PM_2.5 concentration range classification prediction model.

4 Result

4.1 Data statistics

After processing the dataset, there were 14570 remaining samples for model analysis, as shown in Table 3. The distribution of data for each variable is shown in Figure 2. In the dataset, there are 5985 samples with p > 25 μg/m³, and 8585 samples with p ≤ 25 μg/m³.

TABLE 3

TABLE 3. Data description for the model.

FIGURE 2

FIGURE 2. Data distribution of various variables in the training dataset.

4.2 Hyperparameter optimization

The model underwent HPO to improve the performance of the prediction model. The CatBoost model with maximum AUC was obtained using Optuna. The search domain and set values for the hyperparameters of the CatBoost model are shown in Table 4. The search domain and set values for the hyperparameters of the competition models are also shown in Table 4. As shown in the Figure 3, after HPO, the performance of the CatBoost model was improved. The AUC value of the model after HPO is 0.949.

TABLE 4

TABLE 4. Search domain and optimal combination of the main hyperparameters.

FIGURE 3

FIGURE 3. Optimization history of HPO.

4.3 Cross validation

In order to further validate the robustness and generalizability of the CatBoost model, we performed a Time Series Split Cross Validation. We have employed the Rolling Window approach to partition the time series data into several combinations of training and validation sets. This method involves sequentially splitting the data in accordance with the chronological order, where the training set comprises past data and the validation set contains future data. By adopting this approach, we can more effectively simulate the model’s performance on future data. The result of cross validation was shown in Figure 4.

FIGURE 4

FIGURE 4. Cross-validation.

4.4 Model comparison

Further validate the performance of the model by comparing it with other commonly used classification algorithms and CatBoost algorithm. Five other classification algorithms were selected, including multilayer perceptron (MLP) model, GBDT model, Logistic Regression (LR) model, DT model, and k-nearest neighbors (KNN) model. The prediction results are shown in Figure 5. Among the models evaluated, the CatBoost algorithm demonstrates the highest predictive performance, achieving an AUC value of 0.949. The MLP model, GBDT model, DT model, and KNN model show similar AUC values, with scores of 0.917, 0.938, 0.927, and 0.926, respectively. In comparison, the LR model exhibits lower AUC value of 0.888. The Precision-Recall (P-R) curve is a graphical representation that illustrates the trade-off between precision and recall at various classification thresholds. By plotting the P-R curve, we can observe how the model’s precision and recall change as the threshold varies. The PR curves of the models are shown in Figure 6. The F1 score is a single metric that combines precision and recall into a balanced measure of a model’s performance. The F1 score is particularly useful in cases where we want to consider both precision and recall equally important. The F1 scores of the models are shown in Table 5.

FIGURE 5

FIGURE 5. ROC curves of 6 different models.

FIGURE 6

FIGURE 6. P-R curves of 6 different models.

TABLE 5

TABLE 5. Performance of models.

4.5 Importance analysis

The importance of model included features was analized through the SHAP (Shapley Additive exPlans) (Shapley, 1953). Figure 7 shows the SHAP value ranking of each feature and the specific impact of each feature on the output variable. The top feature value has the greatest importance for the output feature, and decreases in order from top to bottom. For identifying whether the indoor PM_2.5 concentration exceeds the standard, the P_h-1 is the most important feature, followed by the d_h-1, the C_h-1 and the t_h-1. The L_h-1 has the least importance. The blue to red color represents the feature value (red high, blue low). The x-axis measures the impacts on the model output (right positive, left negative). From the figure, it can be seen that the greater the P_h-1, the higher the risk of exceeding the indoor PM_2.5 concentration; the lower the d_h-1, the greater the risk of exceeding the indoor PM_2.5 concentration, and vice versa, the smaller the risk; the higher the C_h-1, the greater the risk of exceeding the indoor PM_2.5 concentration. Conversely, the smaller the risk; The lower the t_h-1, the greater the risk of exceeding the indoor PM_2.5 concentration. Conversely, the lower the risk.

FIGURE 7

FIGURE 7. Summary of model SHAP features.

5 Discussion

This paper presents the development of a predictive model based on ML algorithms that accurately predicts indoor PM_2.5 concentration levels. The CatBoost model demonstrates significantly better predictive performance compared to the MLP, GBDT, LR, DT, and KNN models. Moreover, this study identified P_h-1 and d_h-1 as the two most important predictive factors.

Emerging technologies such as the IoT, AI, and ML have shown tremendous potential in monitoring indoor environmental quality and facilitating timely intervention (Adeleke et al., 2017). The remarkable predictive capabilities of ML algorithms make them highly attractive when combined with on-site data monitoring systems to effectively determine the constantly changing levels of indoor pollutants (AdityaSharma and Gupta, 2018; Saini et al., 2022). Previous studies conducted by Elbayoumi and Yuchi have confirmed the value of ML methods in predicting PM_2.5 concentrations (Elbayoumi et al., 2015; Yuchi et al., 2019). These findings align with our research, further validating the efficiency and superiority of ML models. By comparing CatBoost with other commonly used models, we highlight the superior predictive performance of CatBoost, providing valuable insights into its application for indoor air quality prediction.

Given the severity of global air pollution exceeding health thresholds, the impact of air quality on human health has garnered significant attention (Massey et al., 2012). As mentioned in the introduction, PM_2.5 poses a serious threat to human health, particularly due to the potential attachment of harmful microorganisms to particulate matter. Stressing the importance of maintaining good indoor air quality in office buildings cannot be overstated, as it directly affects the health and wellbeing of occupants. Predictability plays a crucial role in controlling PM_2.5 concentrations since indoor air quality improvement systems often exhibit inherent delays. Accurately predicting PM_2.5 concentrations poses a considerable challenge. This study constructs a PM_2.5 concentration classification prediction model using ML algorithms, enabling accurate predictions of whether PM_2.5 concentrations exceed the standard and serving as a risk warning model. The model’s predicted results can be utilized as control signals in the operation and regulation of indoor air quality improvement equipment, such as fresh air systems. By forecasting in advance whether indoor PM_2.5 concentrations will exceed the standard, the model helps determine whether the status quo can be maintained or if purification measures are required. Implementing the model’s predicted results as control signals guides the control system, achieving improved indoor environmental conditions while minimizing unnecessary energy consumption.

Through variable importance ranking analysis, we have determined that previous PM_2.5 concentration is the most significant influencing factor in predicting whether the PM_2.5 concentration exceeds the standard, as our expected. Furthermore, we observed a negative correlation between indoor humidity and the risk of PM_2.5 concentration exceeding the standard. Humidity plays a vital role in the nucleation, condensation, and volatilization of particles, thereby influencing their diffusion process and altering the concentration of PM_2.5 (Chithra and Nagendra, 2014). A study conducted by Yang et al. monitored indoor PM concentrations in a primary school classroom in North China and assessed the contributions of various influencing factors, the findings highlighted the critical role of indoor humidity in managing indoor PM_2.5 concentration (Guangfei and YuheBing, 2023). The research results of this article align with the aforementioned studies, providing further evidence of the close relationship between indoor humidity and the likelihood of PM_2.5 concentration exceeding the standard. In practical applications, this research provides valuable references for real-time assessment and management of indoor air quality. This study reveals that lower humidity increases the risk of indoor PM_2.5 concentrations surpassing the standard in the next time interval, emphasizing the significance of humidity control in improving air quality.

6 Conclusion

In comparison to the MLP, GBDT, LR, DT, and KNN models, the CatBoost model demonstrates notable advantages in predicting whether the indoor PM_2.5 concentration exceeds the standard. Through HPO, the model’s predictive performance can be further enhanced. Additionally, this study identifies the previous PM_2.5 concentration and relative humidity as the two most influential factors for prediction.

Data availability statement

The raw data supporting the conclusion of this article will be made available by the authors, without undue reservation.

Author contributions

XYW is mainly responsible for the initial writing of articles, data analysis, and so on. ZWG is mainly responsible for reviewing articles, project management, and so on. LG is mainly responsible for data collection, article polishing, and so on. All authors contributed to the article and approved the submitted version.

Funding

This study was supported by the Opening Funds of State Key Laboratory of Building Safety and Built Environment and National Engineering Research Center of Building Technology and the key special project of the National Key R&D Plan “Intergovernmental International Science and Technology Innovation Cooperation” titled “Research on Improving Energy Efficiency and Health Performance of Building Operation Based on Full Life Cycle Carbon Reduction” (2018YFE0106100).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Adeleke, J. A., Moodley, D., Rens, G., and Adewumi, A. (2017). Integrating statistical machine learning in a semantic sensor web for proactive monitoring and control. Sensors 17 (4), 807. doi:10.3390/s17040807

PubMed Abstract | CrossRef Full Text | Google Scholar

Aditya, , Sharma, M., and Gupta, S. C. (2018). “An Internet of Things based smart surveillance and monitoring system using arduino[C],” in 2018 International Conference on Advances in Computing and Communication Engineering (ICACCE) (IEEE).

Google Scholar

Ahn, J., Shin, D., Kim, K., and Yang, J. (2017). Indoor air quality analysis using deep learning with sensor data. Sensors 17 (11), 2476. doi:10.3390/s17112476

PubMed Abstract | CrossRef Full Text | Google Scholar

Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M., and Optuna, (2019). “A next-generation hyperparameter optimization framework,” in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2623–2631.

Google Scholar

Amy Isabella, S., Javier, R., and Peter, J. G. (2022). Integrating multiple brain imaging modalities does not boost prediction of subclinical atherosclerosis in midlife adults[J]. NeuroImage Clin. 35, 103134. doi:10.1016/j.nicl.2022.103134

PubMed Abstract | CrossRef Full Text | Google Scholar

BellMichelle, L., and Davis, D. L. (2001). Reassessment of the lethal london fog of 1952: Novel Indicators of acute and chronic consequences of acute exposure to air pollution. J. Environ. Health Perspect. Suppl. 109, 389. doi:10.2307/3434786

PubMed Abstract | CrossRef Full Text | Google Scholar

Carlos, G., Valeria, F., and Guillermo, V. (2018). Use of non-industrial environmental sensors and machine learning techniques in telemetry for indoor air pollution. ARPN J. Eng. Appl. Sci. 13, 2702-2712.

Google Scholar

Cheng, C., Wu, H., and Liu, W. (2019). Indoor PM2.5 prediction based on multi-instance genetic neural network[J]. Comput. Appl. Softw. 36 (5), 7. (in Chinese). doi:10.3969/j.issn.1000-386x.2019.05.041

Classification prediction model of indoor PM2.5 concentration using CatBoost algorithm

1 Introduction

1.1 Motivation

1.2 The application of ML in IAQ prediction

1.3 The application of ML in PM2.5 prediction

1.4 Contribution

2 Methodology

2.1 Data acquisition and processing

2.2 Data preprocessing

3 Prediction model

4 Result

4.1 Data statistics

4.2 Hyperparameter optimization

4.3 Cross validation

4.4 Model comparison

4.5 Importance analysis

5 Discussion

6 Conclusion

Data availability statement

Author contributions

Funding

Conflict of interest

Publisher’s note

References

Classification prediction model of indoor PM_2.5 concentration using CatBoost algorithm

1.3 The application of ML in PM_2.5 prediction