SFDformer: a frequency-based sparse decomposition transformer for air pollution time series prediction

Qin, Zhenkai; Wei, Baozhong; Gao, Caifeng; Chen, Xiaolong; Zhang, Hongfeng; In Wong, Cora Un

doi:10.3389/fenvs.2025.1549209

ORIGINAL RESEARCH article

Front. Environ. Sci., 17 March 2025

Sec. Big Data, AI, and the Environment

Volume 13 - 2025 | https://doi.org/10.3389/fenvs.2025.1549209

This article is part of the Research TopicDust and Polluted Aerosols: Sources, Transport and Radiative Effects Volume IIView all articles

SFDformer: a frequency-based sparse decomposition transformer for air pollution time series prediction

Zhenkai Qin^1,2^†

Baozhong Wei¹

Caifeng Gao¹

Xiaolong Chen³*

Hongfeng Zhang³*

Cora Un In Wong³

¹School of Information Technology, Guangxi Police College, Nanning, China
²School of Computer Science and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China
³Faculty of Humanities and Social Sciences, Macao Polytechnic University, Macao, China

Introduction: With the rapid advancement of industrialization and the prevalent occurrence of haze weather, $P M_{2.5}$ contamination has emerged asa significant threat to public health and environmental sustainability. The concentration of $P M_{2.5}$ exhibits intricate dynamic attributes and is profoundly correlated with meteorological conditions as well as the concentrations of other pollutants, thereby substantially augmenting the complexity of predictive endeavors.

Methods: A novel predictive methodology has been developed. It integrates time seriesfrequency domain analysis with the decomposition of deep learning models. This approach facilitates the capture of interdependencies among high - dimensional features through time series decomposition, employs Fourier Transform to mitigate noise interference, and incorporates sparse attention mechanisms to selectively filter critical frequency components, thereby enhancing time - dependent modeling. Importantly, this technique effectively reduces computational complexity from $O (L^{2})$ to $O (L \log L)$ .

Results: Empirical findings substantiate that this methodology yields notably superior predictive accuracy relative to conventional models across a diverse array of real-world datasets.

Discussion: This advancement not only offers an efficacious resolution for $P M_{2.5}$ prediction tasks but also paves the way for innovative research and application prospects in the realm of complex time series modeling.

1 Introduction

Over the past several decades, the rapid pace of industrialization has precipitated the frequent occurrence of smog, thereby intensifying environmental pollution. Fine particulate matter $(P M_{2.5})$ , characterized by particles with a diameter of 2.5 μm or less, has emerged as a pivotal pollutant that poses considerable risks to human health. As indicated by the World Health Organization (WHO), nearly 90% of the global populace inhales air that surpasses its quality standards, rendering $P M_{2.5}$ a primary contributor to respiratory ailments (Ailshire and Crimmins, 2014; Pöschl, 2005). Additionally, short-term exposure to $P M_{2.5}$ (spanning from hours to weeks) has been associated with cardiovascular-related mortality and other health sequelae (Du et al., 2016). Beyond its ramifications for public health, deteriorating air quality imposes substantial economic burdens. A report by the Organization for Economic Cooperation and Development (OECD) (Lanzi, 2016) underscores that air pollution could lead to global GDP losses of up to 1%.

Developing an efficient air pollution monitoring and prediction system is, consequently, imperative for safeguarding human health and alleviating economic losses. Nonetheless, the formation mechanism of $P M_{2.5}$ is exceptionally intricate(Lu et al., 2021), encompassing complex interactions among various external pollutants, which markedly complicates the prediction process. Furthermore, air quality data exhibit a strong temporal dependence, constituting a prototypical time-series dataset with distinct periodic features. Predicting $P M_{2.5}$ concentrations constitutes a formidable task, necessitating the incorporation of meteorological factors (e.g., precipitation and temperature) and historical data (e.g., $P M_{10}$ , $S O_{2}$ ) into time-series modeling. Extensive research has shown that these factors are highly correlated and have complex relationships in the formation of air pollution (Rakholia et al., 2024; He et al., 2017; Luo et al., 2020; Neiburger, 1969). Consequently, effectively discerning these complex interactions and integrating them into pollutant prediction models has emerged as a pivotal aspect in comprehending pollution mechanisms and improving prediction accuracy (Deng et al., 2024). To tackle the dynamic variations in pollutant concentrations and their intricate feature relationships, a plethora of modeling approaches have been suggested. Conventional statistical methods were extensively utilized in the initial phases of air quality prediction research. These methods predominantly depend on historical data for model training, employing frequently used techniques such as Autoregressive Moving Average (ARMA) (Liu and Yang, 2021) models and Autoregressive Integrated Moving Average (ARIMA) (Liu and Yang, 2021) models. However, as the volume and complexity of data have escalated, these methods have encountered difficulties in meeting the practical demands of real-time forecasting of pollutant concentrations due to prolonged training times and limited scalability.

The advent of deep learning technologies has led to the emergence of Transformer-based deep learning models as innovative solutions for tackling complex problems and enhancing performance. These models are particularly efficacious as they account for the temporal correlations inherent in pollutant concentration sequences. To date, deep learning models have demonstrated state-of-the-art capabilities in time-series prediction tasks. By capitalizing on the neural networks’ ability to extract temporal features from time-series data, the precision of pollutant concentration predictions can be substantially improved. Empirical studies on air pollutant prediction have shown that deep learning models surpass traditional methods, including classical machine learning algorithms, by more effectively capturing high-dimensional feature dependencies and temporal patterns (Panneerselvam and Thiagarajan 2024). Nevertheless, conventional Transformer models encounter several challenges, particularly their substantial computational cost, which is especially significant when dealing with large-scale environmental datasets. The temporal continuity, dynamic fluctuations, and complex intercorrelations within pollutant concentration time-series data further complicate accurate prediction and analysis. Moreover, challenges such as noise, nonlinearity, and high-dimensional complexity inherent in environmental big data pose considerable obstacles for extracting temporal correlation information between pollutant concentrations and meteorological factors (Chen et al., 2024).

To tackle these challenges, this study introduces an end-to-end framework named Sparse Frequency Decomposition Transformer (SFDformer) for predicting the time series of pollutant concentrations. Figure 1 illustrates an overview of the proposed method. This approach uses time-series decomposition to capture the interdependencies among high-dimensional features and employs Fourier Transform to convert the data into the frequency domain, effectively reducing noise interference. The SFDformer integrates a sparse attention mechanism that selectively allocates weights to key frequency components, reducing the computational complexity from quadratic to linear time complexity. This design enhances computational efficiency while accurately extracting crucial features, providing a more accurate and efficient solution for forecasting pollutant concentrations. In summary, the main contributions of this paper are as follows.

• By fully considering the temporal dependencies in the time domain and the characteristic information in the frequency domain, a dual-domain modeling approach is used to accurately extract the complex correlation features between pollutant concentrations and meteorological data.

• We have introduced a frequency sparse attention mechanism based on Fourier transform, which combines sparse attention with Fourier transform to reduce the computational cost of self-attention layers and the impact of noise during the prediction process.

• In the issue of air pollution prediction, extensive experiments on eight real datasets have demonstrated the practicality and feasibility of the proposed model in $P M_{2.5}$ concentration forecasting. Furthermore, the results obtained in this work outperform other deep learning models reported in the literature.

Figure 1

Figure 1. Schematic overview of the proposed SFDformer method. Initially, the Sparse Frequency Decomposition Attention module (Frequency-Sparse Attention, blue block) is designed to perform frequency transformation and reduce model parameters by leveraging Fourier transform and sparse attention mechanisms. More specifically, the Fourier transform converts time-domain data into the frequency domain to reduce the impact of noise, while sparse attention is employed to filter the critical frequency weight matrices. After that, the time series pooling decomposition (TSP Decomp, yellow block) method is utilized to extract seasonal and trend patterns from the input time series data.

2 Related work

The prediction of air pollutant concentrations is currently accomplished through two primary methodologies: physicochemical approaches and data-driven approaches. Physicochemical approaches entail the simulation and analysis of the physical and chemical processes that regulate air pollutants, employing fundamental physical and chemical principles to forecast pollutant behavior across diverse spatial and temporal scales (Thongthammachart et al., 2021; Kang et al., 2018; Hofman et al., 2022). Although these approaches can yield high prediction accuracy, they typically necessitate intricate model configurations and extensive parameter tuning, which may result in limited model generalization and diminished robustness in practical applications (Wang et al., 2020).

Emergence of meteorological stations and analogous monitoring devices, air quality monitoring stations, and meteorological satellites has enabled the gathering of data on air pollutant concentrations and meteorological conditions. This data provides strong support for research on air quality prediction (Gu et al., 2021; Kim et al., 2022). Data-driven methodologies have been increasingly employed to forecast air pollutant concentrations. In the nascent stages of air pollution prediction research, conventional machine learning models such as ARIMA and SARIMA were extensively utilized. These models forecast pollutant concentrations by examining the historical trends and periodic characteristics of time series data (Marvin et al., 2022). While these methods excel in modeling stationary time series and capturing short-term dependencies, they exhibit notable limitations when addressing complex nonlinear relationships and long-term sequence dependencies (Zhou et al., 2018). Specifically, the omission of high-frequency information in traditional machine models results in the loss of critical data, thereby constraining prediction accuracy and applicability. Furthermore, these methods encounter difficulties in leveraging multidimensional data (such as meteorological features and concentrations of other pollutants) to delineate more comprehensive pollutant characteristics (Tagliabue et al., 2021). With advancements in data scale and computational power, machine learning methodologies have progressively emerged as more versatile options. Models such as Support Vector Regression (SVR), Random Forest (RF), and Multi-Layer Perceptron (MLP) have gained widespread adoption due to their efficacy in managing nonlinear relationships (Haq and Ahmad Khan, 2022; Rybarczyk and Zalakeviciute, 2018). These methodologies demonstrate superior predictive performance compared to traditional statistical methods by utilizing multidimensional data for modeling (Ma X. et al., 2023; Pan et al., 2023). However, they depend on manually crafted feature engineering, and their capacity to model the interdependencies of other multidimensional data influencing air pollutant concentrations remains limited (Zaini et al., 2022). Nonetheless, these methodologies have furnished valuable insights into air pollution prediction and established a foundation for investigating hybrid models that integrate traditional methods with deep learning technologies (Kshirsagar and Shah, 2022; Méndez et al., 2023).

The rapid advancement of deep learning technologies has led to significant breakthroughs in their application to time series forecasting, especially in the realm of air pollution prediction. In comparison to traditional statistical methods and classical machine learning techniques, deep learning models exhibit considerable advantages due to their robust ability to model non-linearity and precisely capture temporal dependencies. Recurrent Neural Networks (RNNs) and their sophisticated variants, such as Long Short-Term Memory Networks [LSTMs Han et al. (2023)] and Gated Recurrent Units (GRUs), have been extensively utilized to process time series data (Espinosa et al., 2021). These models adeptly capture long-term dependencies through memory units, effectively mitigating the challenges of vanishing and exploding gradients (Athira et al., 2018; Faraji et al., 2022). Nonetheless, individual models still possess certain limitations in modeling high-dimensional features (Sarkar et al., 2022). To further enhance the performance of air pollution time series forecasting, researchers have devised hybrid architectures, such as LSTM-CNN (Ghimire et al., 2019), LSTM-RNN (Ozcanli et al., 2020), and CNN-LSTM-RNN (Ko and Jung, 2022). These models amalgamate the strengths of distinct neural networks: LSTM-CNN extracts intricate features via CNNs while LSTM captures temporal dependencies, rendering it suitable for managing complex time series data; LSTM-RNN integrates RNN’s capability to handle short-term dependencies with LSTM’s capacity to capture long-term trends, making it ideal for data exhibiting both short-term fluctuations and long-term patterns; CNN-LSTM-RNN consolidates the advantages of CNNs, LSTMs, and RNNs, enabling it to process more intricate air pollution data scenarios. Despite these hybrid models demonstrating substantial performance improvements, they are accompanied by several limitations, such as elevated model complexity, extended training times, substantial hardware resource demands, and difficulties in hyperparameter tuning, which escalate optimization costs. Furthermore, the intricacy of these models often results in overfitting, particularly when data is limited or of inferior quality (Wang et al., 2022; Yuan et al., 2020).

To address these challenges, Transformer-based models have demonstrated exceptional performance in tackling the intricacies of feature modeling, primarily due to their attention mechanism (Zhang and Zhang, 2023). However, conventional Transformer models typically exhibit high computational complexity and are susceptible to noise when managing high-dimensional dependencies (Guo and Mao, 2023). To alleviate these issues, researchers have introduced sparse attention mechanisms that concentrate on crucial dependencies, substantially reducing computational complexity to linear levels while maintaining robust global modeling capabilities (Al-qaness et al., 2023; Ma Z. et al., 2023). Considering that air pollutant concentrations frequently display significant seasonal variations influenced by meteorological factors, integrating time series decomposition and autocorrelation mechanisms can aid the model in better grasping the complex interdependencies among various features in the time series. Furthermore, frequency-domain enhancement techniques have substantially improved the overall performance and efficiency of the models by diminishing noise interference in long-term dependencies (Zeng et al., 2023). Inspired by these advancements, we propose the SFDformer method. This approach employs time series decomposition techniques to segregate the data into seasonal and trend components, effectively capturing factors such as air pollution, which are subject to seasonal fluctuations and trend variations. By employing Fourier transforms to transform time-domain data into frequency-domain data, we mitigate noise interference. The sparse attention mechanism further prioritizes essential frequency components and assigns them higher weights, enabling the model to capture critical short-term alterations while preserving vital long-term traits. This enhancement not only significantly boosts computational efficiency but also bolsters the model’s stability and robustness in capturing the dependencies between high-dimensional features of air pollution concentrations, offering a more efficient and precise solution for intricate air pollution forecasting tasks.

3 Methodology

3.1 Background

The air pollution forecasting problem can be defined in a rolling prediction setting, where the future air quality over a given time horizon is predicted based on historical observations within a fixed-size window. At each time point $t$ , the input sequence $X^{t} = {x_{1}^{t}, \dots, x_{L_{x}}^{t}}$ consists of observed values across multiple feature dimensions. The output sequence $Y^{t} = {y_{1}^{t}, \dots, y_{L_{y}}^{t}}$ predicts air quality indicators, such as concentrations of $P M_{2.5}$ , $P M_{10}$ , $N O_{2}$ , etc., over several future time points. This setup enables the model to predict multiple pollutants simultaneously, making it highly suitable for air quality monitoring and management applications. By providing such predictions, relevant authorities can take proactive measures to mitigate the impact of air pollution, thereby enhancing the quality of life for urban residents.

3.2 Time series pooling decomposition module

In real-world air pollution time series data, intricate seasonal patterns often intertwine with trend components, making them difficult to disentangle. Traditional fixed-window average pooling methods struggle to effectively capture such diverse temporal characteristics. To address this challenge, as depicted in Figure 2, we introduce a Time Series Pooling Decomposition Module (TSP Decomp), meticulously designed to tackle the complexities inherent in environmental time series forecasting.

Figure 2

Figure 2. Time series pooling decomposition module.

This module incorporates a variety of average pooling filters with differing window sizes, allowing for the adaptable extraction of multiple trend components from the input signal. Furthermore, a dynamic weighting mechanism, based on the attributes of the input data, combines these trend components into a comprehensive final trend depiction. As shown in Equations 1, 2:

X_{trend} = S o f t m a x (T (x)) \cdot P (x) (1)

X_{season} = X - X_{trend} (2)

In these two formulas, $P (\cdot)$ denotes a set of average pooling filters, crafted to capture trends across diverse temporal scales. Furthermore, $Softmax (T (x))$ acts as a data-dependent weight allocation function, effectively combining these identified trends into a cohesive final trend representation.

3.3 The mutual conversion between the time domain and the frequency domain

In the scholarly domain of air pollution time series forecasting, the Discrete Fourier Transform (DFT) and its counterpart, the Inverse Discrete Fourier Transform (IDFT), are instrumental in scrutinizing complex periodicity and trend variation patterns. This is accomplished by enabling the transition of time series data between the temporal and frequency domains. The DFT decomposes the time series into long-term trends and periodic components, which facilitates the identification of significant periodic features and the elimination of random noise. Subsequently, the IDFT reconstructs the processed signal back into the time domain.

For a time series $x [n] \in R^{N}$ with a specific length, the DFT is given by Equation 3 as follows:

X [k] = \sum_{n = 0}^{N - 1} x [n] \cdot e^{- i \frac{2 π}{N} k n}, k = 0,1, \dots, N - 1 (3)

The IDFT uses Equation 4 to restore the frequency-domain data to the time domain:

x [n] = \frac{1}{N} \sum_{k = 0}^{N - 1} X [k] \cdot e^{i \frac{2 π}{N} k n}, n = 0,1, \dots, N - 1 (4)

In DFT, determines series length and frequency resolution, indexes frequency components, and $e^{- i \frac{2 π}{N} k n}$ extracts sinusoidal elements, enabling frequency-domain decomposition. In IDFT, these parameters reconstruct the time-domain signal, with providing amplitude and phase, and $e^{i \frac{2 π}{N} k n}$ synthesizing the signal. Together, DFT and IDFT support feature extraction, periodic pattern recognition, trend analysis, and noise reduction. Additionally, represents frequency-domain coefficients, where low frequencies indicate trends, and high frequencies reflect noise or rapid fluctuations.

3.4 Frequency-sparse attention mechanism with fourier transform

3.4.1 Traditional attention mechanisms with quadratic complexity

The conventional attention mechanisms utilize three inputs: $Q$ (the query), $K$ (the key), and $V$ (the value) matrices. These mechanisms compute scaled dot-product attention. This is determined by Equation 5:

A t t e n t i o n (Q, K, V) = Softmax (\frac{Q K^{T}}{\sqrt{d}}) V (5)

In the formulas, the matrices $Q \in R^{L_{q} \times d}$ , $K \in R^{L_{k} \times d}$ , and $V \in R^{L_{v} \times d}$ are defined, with $d$ representing the dimensionality of the input data. When examining the traditional attention mechanisms, particular attention is paid to the distribution of attention for the $i$ -th query, referred to as $q_{i}$ . This distribution is calculated using an asymmetric kernel smoother, which yields the attention associated with the $i$ -th query, as shown in Equation 6:

A t t e n t i o n (q_{i}, K, V) = \sum_{j} \frac{k (q_{i}, k_{j})}{\sum_{j} k (q_{i}, k_{j})} v_{j} - E_{p (k_{j} | q_{i})} [v_{j}] (6)

The probability $p (k_{j} | q_{i})$ is calculated as $\frac{k (q_{i}, k_{j})}{\sum_{j} k (q_{i}, k_{j})}$ , where $k (q_{i}, k_{j})$ represents the asymmetric exponential kernel, expressed mathematically as $\exp (\frac{q_{i} k_{j}^{T}}{\sqrt{d}})$ . This computation entails quadratic dot-product operations, resulting in a computational complexity of $O (L_{q} L_{k})$ . This complexity poses a considerable challenge in terms of memory usage for models designed to improve predictive performance.

3.4.2 Query sparsity measurement

In the traditional attention mechanisms, the attention distribution $p (K_{i} | Q_{i})$ for ith query is represented as a weighted aggregation over all keys. High dot products between queries and keys lead to uneven attention distributions, potentially reducing the significance of individual values. In order to tackle this issue, a mechanism grounded in KL divergence is proposed to assess the resemblance between the attention distribution and a predefined baseline. The degree of similarity is determined by using Equation 7:

K L (Q ‖ p) = - \ln (\frac{1}{L_{n}} \sum_{j = 1}^{L_{n}} e^{\frac{Q_{i} K_{j}^{T}}{\sqrt{d}}}) + \ln L_{K} - \frac{1}{L_{n}} \sum_{j = 1}^{L_{n}} \frac{Q_{i} K_{j}^{T}}{\sqrt{d}} (7)

In the above formula, $L_{n}$ represents the number of keys, $Q_{i} K_{j}^{T}$ indicates the dot product between the query and the key, and $d$ is the dimensionality of the features. The distillation measure, denoted as $M (Q_{i}, K)$ , is defined by Equation 8 as follows:

M (Q_{i}, K) = - \ln (\sum_{j = 1}^{L_{n}} e^{\frac{Q_{i} K_{j}^{T}}{\sqrt{d}}}) + \frac{1}{L_{n}} \sum_{j = 1}^{L_{n}} \frac{Q_{i} K_{j}^{T}}{\sqrt{d}} (8)

A higher $M (Q_{i}, K)$ value indicates a more diverse attention distribution for the ith query, potentially focusing on dominant query-key pairs in the tail of the self-attention output. This approach allows the model to prioritize influential query-key pairs, thereby enhancing the overall effectiveness of the knowledge extraction process.

3.4.3 Frequency-sparse attention mechanism

We apply the Discrete Fourier Transform (DFT) to transform the queries $q$ , keys $k$ , and values $v$ . Subsequently, we execute a comparable attention mechanism in the frequency domain by choosing the Top $u$ weight matrix patterns. The versions of the queries, keys, and values after the DFT transformation are represented as $\tilde{Q} \in C^{M \times D}$ , $\tilde{K} \in C^{M \times D}$ , and $\tilde{V} \in C^{M \times D}$ . The Frequency Sparse Attention Mechanism incorporating Fourier Transform(SFD) is outlined as follows in Equations 9–12:

\tilde{Q} = {Top}_{u} (F (q)) (9)

\tilde{K} = {Top}_{u} (F (k)) (10)

\tilde{V} = {Top}_{u} (F (v)) (11)

SFDAttention (q, k, v) = F^{- 1} (Padding (σ (\tilde{Q} \cdot {\tilde{K}}^{T}) \cdot \tilde{V})) (12)

In the above formula, $σ$ represents an activation function. We utilize either softmax or tanh as the activation function, since their convergence performance differs among various datasets. Let $Y$ be defined as $Y = σ (\tilde{Q} \cdot {\tilde{K}}^{T}) \cdot \tilde{V}$ , where $Y \in C^{M \times D}$ . The structure of the Frequency Sparse Attention Mechanism with Fourier Transform (SFD) is depicted in Figure 1.

At the screening frequency, it is sufficient to randomly sample $u = L_{K} \ln L_{Q}$ dot-product pairs for the computation of $M (Q_{i}, K)$ , with the remaining pairs being effectively filled with zeros. From these sampled pairs, the sparse ${Top}_{u}$ is selected as $Q$ . The maximum operator in $M (Q_{i}, K)$ exhibits reduced sensitivity to zero values, thereby ensuring numerical stability. In practical applications, the input lengths of queries and keys are typically equivalent in self-attention computations, i.e., $L_{Q} = L_{K} = L$ . Consequently, the overall time complexity and space complexity of the SFDAttention mechanism are $O (L \log L)$ .

4 Experiment

4.1 Data description

This research employed historical data on pollutant concentrations and meteorological conditions, gathered from monitoring stations situated in eight different cities throughout China. The dataset spans the timeframe from 28 October 2013, to 31 May2021. The experimental data in this study is based on a city-level perspective, where daily sample data for each city is represented as a one-dimensional feature vector, with feature elements consisting of pollutants and meteorological factors. The eight selected cities are Baoding, Handan, Jingzhou, Shijiazhuang, Xingtai, Yulin, Lishui, Urumqi,Jingzhou are among the selected cities, each exhibiting unique economic development characteristics within China. These cities are strategically positioned across various geographical regions of the country(see Figure 3). In the analysis, six distinct types of pollutants were considered, alongside three indicators for evaluating pollution levels and three meteorological factors that influence pollutant concentrations. each of which has distinctive characteristics in terms of economic development in China. The selected cities are strategically located across diverse geographical regions within the nation, each presenting distinct pollution characteristics. (refer to Table 1 for details): air quality grade, AQI index, daily AQI ranking, $O_{3}$ , $P M_{10}$ , ${S o}_{2}$ , ${N o}_{2}$ , CO, $P M_{2.5}$ , Temperature, Wind speed, and Precipitation. In Figure 4, we present the daily $P M_{2.5}$ concentration for each city segment in the dataset from 28 October 2020, to 31 May 2021. In accordance with established procedures, the entirety of the compiled datasets was methodically divided into training, validation, and test subsets, arranged sequentially over time, and following a specified allocation ratio of 7:1:2 (Hua et al., 2019).

Figure 3

Figure 3. Geographical locations of cities.

Table 1

Table 1. Characteristic indicators of air pollution time prediction dataset. We utilize Indicator to represent various features within the air pollution dataset. The Indicator Definition elucidates the meaning of each feature, while the Corresponding Characteristics describe the specific attributes associated with these features.

Figure 4

Figure 4. $P M_{2.5}$ concentrations in various cities.

4.2 Implementation details

Harnessing the benefits of Transformer architectures in managing time series information, we integrated residual connections into our model, embedding them within decomposition blocks (Yu et al., 2024). These blocks incorporate functionalities like moving averages, which assist in evening out periodic oscillations and highlighting long-term tendencies within the time series data. As a result, residual connections significantly improve the model’s ability to perceive and assimilate complex patterns inherent in time series, thereby significantly boosting its proficiency in long-term projections. To further enhance the self-attention mechanism, we subjected the input features to nonlinear transformations and dimensional alterations via a Multi-Layer Perceptron (MLP), resulting in innovative feature renditions. This tactic allows the model to more precisely detect intricate patterns and profound interconnections embedded in the time series information, ultimately refining its overall predictive capabilities. Our training methodology utilizes L2 loss along with the ADAM optimizer (Kingma and Ba, 2015), initiated with a learning rate of 0.0001 and a batch size of 32. The attention factor is established at 3, and weight decay is set to 0.1. Training concludes prematurely after 10 epochs. Every experiment was replicated thrice and executed using PyTorch (Paszke et al., 2019), facilitated on a solitary NVIDIA Tesla V100 32 GB GPU (Markidis et al., 2018).

In this study, we utilize Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE) as three essential criteria to assess the predictive accuracy of the SFDformer model. The detailed explanations for calculating these indicators are provided in Equations 13–15:

MSE = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2} (13)

MAE = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - {\hat{y}}_{i}| (14)

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}} (15)

Where $y_{i}$ represents the actual observed value, ${\hat{y}}_{i}$ is the predicted value from the model, and $n$ is the total number of data points, these metrics allow us to intuitively evaluate the accuracy of the model’s predictions. Lower values of Mean Squared Error (MSE) and Mean Absolute Error (MAE) indicate that the predicted values are closer to the actual values, suggesting better predictive performance. Additionally, a lower Root Mean Squared Error (RMSE) implies a better model fit to the data, indicating more reliable prediction results.

We evaluated seven baseline methods for comparative analysis. In the multivariate setting, we selected four Transformer-based models: Autoformer (Wu et al., 2021), Informer (Zhou et al., 2021), Reformer (Kitaev et al., 2020), and Pyraformer Liu et al. (2022), in addition to one model based on linear networks: FiLM (Zhou et al., 2022a). For the univariate setting, we considered more competitive baselines: FEDformer (Zhou et al., 2022b), and a model based on MLP: LightTS (Campos et al., 2023).

4.3 Main results

4.3.1 Multivariate results

Multivariate analysis involves the simultaneous consideration of multiple time series to examine the interrelationships and influences among them.In the multivariate settings, we conducted experiments using eight different datasets. The results indicate that SFDformer consistently achieved state-of-the-art performance across most baseline and prediction horizon configurations (see Table 2). Specifically, under the input-96-predict-58 (The model utilizes 96 historical data points to forecast 58 future data points) configuration, SFDformer reduces the MSE by 0.9% in Baoding (0.328 $\to$ 0.325), 1.6% in Handan (0.471 $\to$ 0.463), 1.8% in Shijiazhuang (0.375 $\to$ 0.368), 1.5% in Xingtai (0.443 $\to$ 0.436), 0.7% in Yulin (0.763 $\to$ 0.757), 2.1% in Lishui (0.601 $\to$ 0.588), 7.6% in Urumqi (0.364 $\to$ 0.336), and 6.4% in Jingzhou (0.823 $\to$ 0.770) compared to previous state-of-the-art results. Overall, in this configuration, the average MSE reduction for SFDformer is 22.46%. Furthermore, on the Shijiazhuang dataset, SFDformer did not exhibit optimal performance in the input-96-predict-12 and input-96-predict-36 settings. However, its performance improves as the prediction horizon extends. This improvement can be attributed to the relatively minor impact of noise in short-term forecasting, whereas long-term forecasting is more influenced by the intricate temporal patterns inherent in real-world time series, demonstrating SFDformer’s ability to better handle complex temporal patterns.

Table 2

Table 2. Multivariate results with different prediction lengths $O \in {12,36,58,96}$ for eight different datasets when $I = 96$ . MSE Reduction refers to the percentage decrease in MSE of SFDformer compared to other models. The best average results are in bold, while the second-best results are underlined.

4.3.2 Univariate results

Univariate analysis predicts future values based solely on the historical data of a single time series. We showcase the univariate outcomes for eight illustrative datasets, as depicted in Table 3. In contrast with numerous baseline models, SFDformer achieves cutting-edge performance in prediction tasks. Notably, under the input-96-predict-58 setup, our model diminishes the mean absolute error (MAE) on the Baoding dataset by 1.1% (0.339 $\to$ 0.335). Regarding the Handan dataset, the model lowers the MAE by 0.4% (0.404 $\to$ 0.402), Regarding the Shijiazhuang dataset, the model lowers the MAE by 1.8% (0.320 $\to$ 0.314), Regarding the Xingtai dataset, the model lowers the MAE by 1.1% (0.344 $\to$ 0.340), Regarding the Yulin dataset, the model lowers the MAE by 0.7% (0.551 $\to$ 0.547), Regarding the Lishui dataset, the model lowers the MAE by 0.8% (0.336 $\to$ 0.333), Regarding the Urumqi dataset, the model lowers the MAE by 8% (0.357 $\to$ 0.326), Regarding the Jingzhou dataset, the model lowers the MAE by 7% (0.496 $\to$ 0.458). Moreover, as the prediction timeline extends, the model’s proficiency stays consistent, underscoring its resilience in forecasting $P M_{2.5}$ air pollution concentration levels.

Table 3

Table 3. Univariate results with different prediction lengths $O \in {12,36,58,96}$ for eight different datasets when $I = 96$ . MAE Reduction refers to the percentage decrease in MAE of SFDformer compared to other models. The best average results are in bold, while the second-best results are in underlined.

4.3.3 Ablation research

This study assesses the impact of the Sparse Frequency Domain Attention (SFDA) module on model performance via an ablation experiment. Three variants of SFDformer were tested: KEDformer, which entirely substitutes both the self-attention and cross-attention mechanisms with SFDA; SFDformerV1, which replaces only the self-attention mechanism with SFDA while maintaining the cross-correlation attention mechanism; and SFDformerV2, which employs self-correlation attention to manage both mechanisms. The experiments were conducted on eight datasets, as illustrated in Table 4. SFDformer exhibited performance enhancements in 90 out of 96 test cases. Importantly, the SFDformer integrated with the SFDA module consistently demonstrated improvements across all cases, corroborating the effectiveness of SFDA in substituting traditional attention mechanisms and significantly improving the model’s performance.

Table 4

Table 4. Ablation study results with different prediction lengths $O \in {12,36,58,96}$ for eight different datasets when $I = 96$ . MSE Reduction refers to the percentage decrease in MSE of SFDformer compared to other models. The best average results are shown in bold, and the second-best in underlined.

5 Discusion

5.1 Efficiency analysis and performance analysis

The present study comprehensively evaluates the impact of various self-attention mechanisms on model performance and computational efficiency, with a detailed analysis of the trade-offs between these two aspects (see Figure 5). To further verify the model’s generalization capability across regions with different levels of air pollution, two distinct locations were selected: Handan, situated in northern China and characterized by relatively severe air pollution, and Lishui, located in eastern China with relatively mild air pollution. The SFDformer model stands out from other models by integrating Fourier transform and sparse attention techniques into its attention mechanism, thereby significantly enhancing prediction accuracy. Compared with traditional Transformer models, SFDformer effectively mitigates the inherent quadratic complexity of conventional attention mechanisms, leading to a substantial improvement in operational efficiency. This feature makes SFDformer particularly well-suited for handling large-scale time series datasets, such as those used in air pollution forecasting tasks.

Figure 5

Figure 5. Experiment for evaluating computational efficiency and performance. The input length was fixed at $l = 96$ , with prediction lengths set to $O \in {12,36,58,96}$ . Computational efficiency was measured by the time (in seconds) required for each model to complete one hundred epochs. Performance was assessed using Mean Squared Error (MSE) and Mean Absolute Error (MAE) as the key metrics.

5.2 Computation efficiency

In the multivariate setting and with the current optimal implementation of all methods, SFDformer has achieved a significant enhancement in computational efficiency compared to conventional Transformer models. This improvement effectively addresses the challenges associated with the quadratic time complexity $O (L^{2})$ and memory usage $O (L^{2})$ inherent to standard self-attention mechanisms. By employing sparse attention and the discrete Fourier transform, SFDformer reduces both the time complexity and memory usage to $O (L \log L)$ , thereby enhancing the model’s capability to handle real-world scenarios of air pollutant concentration prediction. During the testing phase, SFDformer completes predictions in a single step, in contrast to traditional models that require $O (L)$ steps, thereby substantially increasing its efficiency. As demonstrated in Table 5, SFDformer strikes a superior balance between computational efficiency and predictive accuracy, rendering it a practical solution for air pollutant concentration prediction tasks in resource-constrained environments.

Table 5

Table 5. Comparison of accuracy and efficiency metrics for different methods.

5.3 Performance impact of time series decomposition and frequency transformation

To explore the effectiveness of time series decomposition and Fourier transform techniques, we conducted experimental studies using datasets from Handan and Lishui, two regions with significantly different levels of air pollution. As illustrated in Figure 6, the SFDformer model integrates both techniques, whereas the SFDformer-NF model excludes the Fourier transform step, and the SFDformer-NFD model omits both techniques. The experimental results elucidate that the SFDformer model surpasses the other two models, with performance enhancements stemming from several pivotal factors. Primarily, the time series decomposition technique enables the model to directly emulate the seasonal variations in air pollutant concentrations, thereby more accurately capturing periodic patterns and significantly improving the model’s ability to make predictions based on historical data. Secondly, the application of the Fourier transform allows the model to discern and accentuate crucial features in the data while mitigating noise interference, ensuring the model concentrates on the most pertinent information during predictions. These findings substantiate the efficacy of time series decomposition and Fourier transform techniques in improving model performance. This version adheres to the standards for scientific articles, employing clear and precise language.

Figure 6

Figure 6. $P M_{2.5}$ Concentration Prediction Results for Handan and Xintai Datasets. The prediction results of the SFDformer model are indicated by a red line, the SFDformer-ND model is represented by a blue line, and the performance of the SFDformer-NFD model is depicted by a yellow line.

5.4 Generalization and predictive insights of the model on pollutant levels

In this study, the SFDformer model demonstrated remarkable precision in predicting $P M_{2.5}$ concentrations. Industrial development is one of the sources of various air pollutants and is also a key factor contributing to air pollution. To further evaluate the generalization ability of this model, we selected two regions with different industrial characteristics for experiments: Handan, a city with significant heavy industrial development, and Lishui, a region dominated by light industrial activities. We applied the SFDformer model to predict the concentrations of additional pollutants, including $P M_{10}$ , carbon monoxide, sulfur dioxide, and ozone. As shown in Figure 7, the SFDformer model exhibited remarkable proficiency across these diverse pollutant prediction tasks. The experimental results clearly indicate that the SFDformer model outperforms alternative models in terms of generalization capability.

Figure 7

Figure 7. Prediction results for $S O_{2}$ , $O_{3}$ , $P M_{10}$ , and CO concentrations. The red areas represent the SFDformer model, the blue areas represent the Autoformer model, the green areas represent the LightTS model, and the purple areas represent the Film model.

6 Conclusion

The rapid advancement of deep learning technologies has led to their widespread adoption across both academia and industry. This paper presents a novel framework, SFDformer, which seamlessly integrates time series decomposition, Fourier transform, and sparse attention mechanisms. Through the employment of time series decomposition, SFDformer adeptly captures the seasonal fluctuations and long-term trends of $P M_{2.5}$ concentrations, elucidating the interplay between short-term variations and long-term patterns. The fusion of Fourier transform and sparse attention mechanisms not only substantially reduces computational complexity from quadratic to linear, thereby significantly enhancing computational efficiency, but also effectively mitigates noise interference from air pollution features during the prediction process. This dual mechanism’s design minimizes the impact of noise on prediction outcomes, enabling the model to better adapt to the temporal dynamics of the real world, which is pivotal for the accurate forecasting of $P M_{2.5}$ concentrations, a critical air pollution indicator.

In future research, we will focus on enhancing the adaptability of SFDformer to diverse datasets, especially those with irregular patterns. We are confident that through further optimization and expansion, SFDformer will achieve even more remarkable results in the highly challenging field of air pollution time-series forecasting. In summary, SFDformer has made significant breakthroughs in addressing the complexities of air pollution time-series forecasting. This achievement not only demonstrates its strong effectiveness but also highlights its great potential and broad application prospects in this critical field.

Data availability statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/supplementary material.

Author contributions

ZQ: Conceptualization, Methodology, Formal Analysis, Investigation, Writing–original draft. BW: Methodology, Software, Investigation, Data curation, Writing–original draft, Visualization. CG: Investigation, Methodology, Validation, Writing–original draft. XC: Conceptualization, Investigation, Resources, Writing–review and editing, Supervision, Project administration. HZ: Conceptualization, Validation, Formal analysis, Resources, Writing–review and editing, Supervision, Project administration. CUIW: Validation, Resources, Writing–review and editing.

Funding

The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Ailshire, J. A., and Crimmins, E. M. (2014). Fine particulate matter air pollution and cognitive function among older us adults. Am. J. Epidemiol. 180, 359–366. doi:10.1093/aje/kwu155

PubMed Abstract | CrossRef Full Text | Google Scholar

Al-qaness, M. A., Dahou, A., Ewees, A. A., Abualigah, L., Huai, J., Abd Elaziz, M., et al. (2023). Resinformer: residual transformer-based artificial time-series forecasting model for pm2. 5 concentration in three major Chinese cities. Mathematics 11, 476. doi:10.3390/math11020476

SFDformer: a frequency-based sparse decomposition transformer for air pollution time series prediction

1 Introduction

2 Related work

3 Methodology

3.1 Background

3.2 Time series pooling decomposition module

3.3 The mutual conversion between the time domain and the frequency domain

3.4 Frequency-sparse attention mechanism with fourier transform

3.4.1 Traditional attention mechanisms with quadratic complexity

3.4.2 Query sparsity measurement

3.4.3 Frequency-sparse attention mechanism

4 Experiment

4.1 Data description

4.2 Implementation details

4.3 Main results

4.3.1 Multivariate results

4.3.2 Univariate results

4.3.3 Ablation research

5 Discusion

5.1 Efficiency analysis and performance analysis

5.2 Computation efficiency

5.3 Performance impact of time series decomposition and frequency transformation

5.4 Generalization and predictive insights of the model on pollutant levels

6 Conclusion

Data availability statement

Author contributions

Funding

Conflict of interest

Generative AI statement

Publisher’s note

References

94% of researchers rate our articles as excellent or good