Association, Correlation, and Causation Among Transport Variables of PM2.5

Zhao, Zhi-Dan; Zhao, Na; Ying, Na

doi:10.3389/fphy.2021.684104

ORIGINAL RESEARCH article

Front. Phys., 01 July 2021

Sec. Interdisciplinary Physics

Volume 9 - 2021 | https://doi.org/10.3389/fphy.2021.684104

Association, Correlation, and Causation Among Transport Variables of PM_2.5

Zhi-Dan Zhao^1,2*^†

Na Zhao^3,4^†

Na Ying⁵*

¹Complexity Computation Lab, Department of Computer Science, School of Engineering, Shantou University, Shantou, China
²Key Laboratory of Intelligent Manufacturing Technology (Ministry of Education), Shantou University, Shantou, China
³Key Laboratory in Software Engineering of Yunnan Province, School of Software, Yunnan University, Kunming, China
⁴Electric Power Research Institute of Yunnan Power Grid, Kunming, China
⁵China State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences, Beijing, China

The issue of $P M_{2.5}$ pollution has received significant attention in the literature as it has social, economic, and political implications. Big data sets have been collected by pollution monitoring stations (i.e., nodes) throughout the world, and this has made it possible to quantitatively characterize the dependence of $P M_{2.5}$ pollution in different regions. Here we divide the dependency relationship into three types: association, correlation, and causation. This study conducted such relationships using three approaches: the random matrix theory (RMT), cross-correlation, and convergent cross-mapping (CCM). The aim of this study is to determine the above three relationships between pollution data from different nodes. A random matrix analysis revealed that pollutant time series are not completely random, but are associated. Further analysis showed that $P M_{2.5}$ sequences had clear short-range correlations, yet the long-range correlations were blurred. Moreover, at the collect level, there were no clear causalities among pollutant concentrations from different geographical regions, regardless of distance and direction. These results indicate that the dependence of $P M_{2.5}$ pollution between different sites is complex. Nonetheless, this comprehensive analysis based on big data provided insights into critical issues of general interest, including pollution-induced climate change, and pollution abatement.

1 Introduction

Air quality is a significant environmental concern around the world. Air pollution and aerosols have significant impacts on human health, climates, meteorological phenomena, and the environment, and many studies have focused on these effects [1–4]. $P M_{2.5}$ , as an airborne particulate, is the deadliest form of air pollution due to its ability to penetrate deep into the lungs and bloodstream, unfiltered [4]. This allows it to cause permanent DNA mutations, heart attacks, and premature deaths, and lead to the deaths of three to seven million people every year [5, 6]. Moreover, the local nature of air pollution means that the particle can significantly impact temperature, precipitation, and extreme events at a regional level, e.g., aerosols affect regional climates and ocean-atmosphere feedback [2, 7]. Thus, as Booth et al. have demonstrated, anthropogenic aerosol emissions influence historical climate events, such as peaks in hurricane activity and Sahel droughts [8]. Thus, scientists have developed new ways to understand the different factors that contribute to poor air quality, and this research has been used to develop observational systems using models and data and to assist decision makers with air quality assessments.

To investigate these influences, Rohde et al. developed a technique for mapping air pollution concentrations and sources using data from monitoring stations; after studying $P M_{2.5}$ pollution in China for about four months, a short-distance effect was found [9]. Dai et al. analyzed six pollutants in 350 Chinese cities and found both long-term correlations and a relationship between spatial correlations and provincial administrative divisions [10]. Additionally, teleconnections were found to indicate relationships between climate anomalies at significant distances (i.e., thousands of km) [11, 12]. Moreover, Yu et al. conducted mineral analyses to demonstrate that the long-range transport of soil particles contributed significantly to high concentrations of $P M_{2.5}$ during “dust days” [13]. Kaneyasu et al. focused on the impacts of long-range $P M_{2.5}$ transports in Kyushu area and noticed that the $P M_{2.5}$ concentration is primarily dominated by the inflow of long-range transported aerosols [14]. Perrone et al. demonstrated that Mediterranean sites may be affected by long-range transported pollution, and its pollution depends on the airflow [15]. Zhang $e t a l .$ found that the strongest correlation between winter and the $P M_{2.5}$ concentration in the North China Plain, which is mainly caused by the transport of $P M_{2.5}$ [16]. Recently, numerous researchers have used network methods to study climate and environmental issues, where nodes in the network represent geographic coordinate sites, and cross-correlation and mutual information about the time series of two nodes are used to represent connected edges. And they found that climate networks had very strong links that were caused by a proximity (i.e., distance) effect. Namely, pairs of sites close to each other (less than 2,000 km apart) were often strongly and positively correlated [17–26].

Technically speaking, the association is different from the correlation. Association means that one variable provides information about another, but correlation means that two variables show an increasing or decreasing trend. Correlation means an association, but not causation. On the contrary, causality means an association, not correlation [27]. Despite this research, little attention was given to PM_2.5 in regards to different temporalspatial scales and causal-effect among different sites. It was generally accepted that climate change causality detection and human crizes had important roles in future research on climate and environmental policies [28–30].

In order to explore the association, correlation and causality between $P M_{2.5}$ time series in different regions with the changes in the direction and distance of the monitoring sites, we use the RMT method to detect the association, the cross-correlation method to measure the correlation, and finally the CCM to understand the causality. The latter was recently developed as a non-linear dynamics based method for ascertaining and quantifying causal relationships between time series [30, 31]. The results of the study are as follows. First, the RMT analysis revealed that the time series of PM_2.5 at different sites is not completely random, and there is a certain association. Secondly, a conventional correlation analysis indicated that there is a clear short-distance correlation between the time series from different sites, but there is no concise and clear correlation in the long-distance range. Finally, the analysis results of the causality detection algorithm CCM demonstrated that at the collect level, the causality between the $P M_{2.5}$ time series of different nodes is not obvious, and there is no distinct relationship with the distance and direction between the sites.

2 Data Collection

Empirical big data sets were analyzed for this study. They were obtained from global, Chinese, and American monitoring stations. Time resolutions were given in hours, which allowed the researchers to capture time evolutions in relation to $P M_{2.5}$ . Moreover, in order to meet calculation requirements for correlation and causality algorithms, the data was cleaned: empty data segments were removed, it was ensured that the length of each original time series was more than 8,760 h (about 1 y: $24 * 365$ ), and all the time series were adjusted to have common starting and ending points. As a result, the length, $L_{h}$ , of each time series (after the cleaning) was less than 8,760 h. The basic statistical properties of the filtered data sets can be seen in Table 1, where N is the number of nodes (i.e., cities, counties, or regions) and $L_{h}$ is the length of each time series in an hourly resolution. Detailed descriptions of the different public data sets are given below.

TABLE 1

TABLE 1. The basic properties of the three data and parameters of the RMT analyses.

2.1 Global Stations

The global data was collected from [32]. It comprised names, longitudes and latitudes, recorded times (years, months, and hours), and $P M_{2.5}$ values. The data was obtained from December 2016 to December 2017 through a monitoring network that operated in 632 regions (or cities) across the world.

2.2 Chinese Stations

The Chinese data was collected from [33]. It comprised names, longitudes and latitudes, recorded times (years, months, and hours), and $P M_{2.5}$ time series. The data was obtained from January 2015 to June 2017 through a monitoring network that operated in 365 cities across China.

2.3 United States Stations

The American data was collected from [34]. It comprised names, longitudes and latitudes, and $P M_{2.5}$ time stamps. The $P M_{2.5}$ data was divided into two categories, firm and non-firm, and the non-firm data was used for the analyses. The data was obtained from January 2016 to December 2016 through a monitoring network that operated in 137 regions (or counties) across the United States (USA).

3 Methods

This chapter discusses the two correlation calculation methods, i.e., cross correlation and the RMT, which were used in this study to determine correlations between nodes. Moreover, this chapter discusses the causality detection algorithm, CCM, which was used to measure causalities between the nodes. Finally, the azimuth (α) polar coordinate system and distance (d) are discussed.

3.1 Cross Correlation

Previous studies used cross-correlation analyses to measure correlations between node distances [17–21]. For this study, a cross-correlation method [18–20] was used to calculate the $W_{m a x}$ and hour resolution ( $L_{h}$ of each time series; (see Table 1). Given ${\tilde{T}}^{d} (h)$ , where d was a day and h was an hour (from zero to 23), each filtered record was defined as $T^{d} (h) = {\tilde{T}}^{d} (h) - (1 / L_{d}) \sum_{d} {\tilde{T}}^{d} (h)$ , where $L_{d}$ was the total number of days, as shown in Table 1. For each pair of nodes, i.e., i and j, a cross-correlation in time series was calculated as follows.

C_{i j}^{d} (τ) = \frac{〈 T_{i}^{d} (h) T_{j}^{d} (h - τ) 〉 - 〈 T_{i}^{d} (h) 〉 〈 T_{j}^{d} (h - τ) 〉}{σ_{T_{i}^{d} (h)} σ_{T_{i}^{d} (h - τ)}} (1)

where $σ_{T_{i}^{d} (h)}$ was the standard deviation of $T_{i}^{d} (h)$ , τ was the time lag with a max value of 30 days and $C_{i j}^{d} (τ) = C_{j i}^{d} (- τ)$ . Maximum time lag was defined as $τ_{\max}$ , with which $C_{i j}^{d} (τ_{\max})$ is maximum. Then, positive link weights $(W_{\max})$ were defined as follows.

W_{\max}^{i j} = \frac{C_{i j}^{d} (τ_{\max}) - m e a n [C_{i j}^{d} (τ)]}{s t d [C_{i j}^{d} (τ)]}, (2)

where the average was a mean and standard deviation was denoted by “std.”

3.2 Random Matrix Theory

A challenge could arise when interpreting correlations involving the $P M_{2.5}$ time series in that the exact natures of interactions were unknown. However, the RMT was a significant theory in data analysis often used to extract underlying information in a time series. Therefore, with minimum assumptions about random Hamiltonian statistics and a real symmetric matrix with independent random elements, the RMT was implemented to address significant amounts of spectroscopic data in regards to the energy levels of complex quantum systems [35, 36]. The simplest way to determine correlations between the different time series was to use the equal time cross correlation matrix C, which had elements of one in that $τ = 0$ [37, 38].

After this was completed, the statistical properties of matrix C were determined by employing the RMT’s processes. Following this RMT procedure, C was first diagonalized and the eigenvalue λ was obtained. Next, $p (λ)$ was defined as an eigenvalue density as follows.

p (λ) = \frac{1}{N} \frac{d n (λ)}{d λ} (3)

where N is the number of nodes (as shown in Table 1), and $n (λ)$ was the number of eigenvalues for C that were less than λ. Following previous studies, it was determined that $Q \equiv L_{d} / N$ and $L_{d}$ indicated the length of a time series in a day resolution. Then, $p (λ)$ was computed as follows.

p (λ) = \frac{Q}{2 π σ^{2}} \frac{\sqrt{(λ_{\max} - λ) (λ - λ_{\min})}}{λ} (4)

where $λ_{\min} \leq λ \leq λ_{\max}$ and $σ^{2}$ were equal to 1 with this normalization. Additionally, $λ_{\max}$ and $λ_{\min}$ were calculated as follows.

λ_{\min}^{\max} = σ^{2} (1 + 1 / Q \pm 2 \sqrt{1 / Q}) . (5)

It should be noted that although the RMT was a powerful method for identifying clues to the underlying interactions of the systems, its parameter choices differed slightly from the datasets. The basic parameters of the RMT used in this study are listed in Table 1, where $λ_{\max}$ and $λ_{\min}$ are the maximum and minimum eigenvalues from Eq. 4 and $λ_{\max}^{r e a l}$ is the maximum eigenvalue from the real data.

3.3 Convergent Cross-Mapping

Causality has been investigated in many studies, such as social, economic, climatology, and gene perturbation experiments [39–41]. Indeed, identifying causality in complex systems can be difficult but exciting in nature, and determining causal relationships is pertinent to many disciplines with broad applications. Traditionally, Granger causality analyses can be used as paradigmatic frameworks to determine such relations [39, 42]. However, Granger causality is linear and multivariate in nature and involves statistical regression, so various methods derived from such causality are required for extensive data [31]. Entropy based methods result in similar difficulties [43], but CCM [31] is based on non-linear time series analyses [44] and was developed to overcome these challenges. That is, CCM is powerful for detecting and quantifying causations between pairs of dynamic variables based on time series [31].

For this study, a phase space was developed for each variable based on a delay-coordinate embedding method [44]. For example, for time series $x (t),$ the reconstructed vector $X (t) = {x (t), x (t τ), \dots, x [t (E_{x} - 1) τ]}$ was used, where τ was a delay time and $E_{x}$ was an embedding dimension. The same could be done for time series $y (t)$ of variable y to yield a reconstructed vector in dimensional space $E_{y}$ . The basic principle was to compare the predictions in each subspace. Consider the pair of vectors $[X (t), Y (t)]$ at time t, one vector from each subspace. In subspace Y, one could find a set of neighboring vectors for $Y (t)$ and identify the corresponding set in subspace X based on which one could be used to predict the value of $X (t)$ . The difference between $X (t)$ and its predicted value characterized the accuracy of the prediction. Similarly, based on neighboring vectors in subspace X, a prediction in subspace Y could be made. Comparing prediction accuracies regarding the two subspaces could determine the causal relationship between X and Y.

The principle underlying this method was asymmetry in regards to directional predictability. Suppose one wished to detect a causal interaction between two subsystems with the state variables $X (t)$ and $Y (t)$ , respectively. Using $X (t)$ , the value of $Y (t)$ could be predicted, such as $\hat{Y} (t)$ , and correlation $ρ_{Y X} (t)$ could be predicted between $Y (t)$ and $\hat{Y} (t)$ . Similarly, using $Y (t)$ , a prediction could be obtained for $X (t)$ , such as $\hat{X} (t)$ , and the correlation $ρ_{X Y} (t)$ between $X (t)$ and $\hat{X} (t)$ could be calculated. If no causal relationship existed between $X (t)$ and $Y (t)$ , the predictions in both directions were even, so statistically, the correlations $ρ_{Y X} (t)$ and $ρ_{X Y} (t)$ could not be distinguished from each other. That is, $Δ \equiv ρ_{X Y} (t) - ρ_{Y X} (t) = 0$ . However, if $X (t)$ was more a cause of $Y (t)$ than the opposite, the prediction $X (t)$ , obtained from $Y (t)$ , was better than that of $Y (t)$ [from $X (t)$ ]. This was because information about $X (t)$ was contained in $Y (t)$ . Thus, it was determined that $ρ_{X Y} (t)$ was greater than $ρ_{Y X} (t)$ , or $Δ$ was greater than 0. Moreover, statistically positive $Δ$ values could be considered heuristic criteria for determining that the direction of the causal interaction was from $X (t)$ to $Y (t)$ . Likewise, if $Δ$ was less than 0, it indicated that $Y (t)$ was more a cause of $X (t)$ than the opposite [31].

3.4 Polar Coordinate System

Generally, the polar system is a two-dimensional coordinate method in which each point on a plane is determined by a distance from a reference point and an angle from a reference direction. The reference point (analogous to the origin of a Cartesian coordinate system) was called a pole, and a ray from the pole in the reference direction was the polar axis. The distance from the pole was called a radial coordinate, radial distance, or simply radius, and the angle was called an angular coordinate, polar angle, or azimuth.

3.5 Distance

Orthodromic distance, the shortest distance between two points on the surface of a sphere, was measured along each surface. In particular, for any two i and j points specified by $ϕ_{i}$ , $η_{i}$ and $ϕ_{j}$ , $η_{j}$ , where ϕ was a geographical latitude and η was a geographical longitude, $Δ ϕ$ and $Δ η$ were absolute differences. The spherical law of cosines was then used for the central angle $Δ σ$ between i and j as follows.

Δ σ = \arccos [\sin ϕ_{i} \cdot \sin ϕ_{j} + \cos ϕ_{i} \cdot \cos ϕ_{j} \cdot \cos (Δ η)] . (6)

The distance was obtained using $d = r Δ σ$ , where r was the radius of the sphere.

3.6 Azimuth

Azimuth, denoted as α, was defined as a horizontal angle measured clockwise from a north base line or meridian. For example, for reference point i with the latitude $ϕ_{i}$ and the longitude $η_{i}$ , the azimuth of point j ( $ϕ_{j}$ , $η_{j}$ ) was determined using the following equation [45].

α_{i j} = \arctan (\frac{\sin (η_{j} - η_{i})}{\cos ϕ_{i} \tan ϕ_{j} - \sin ϕ_{i} \cos (η_{j} - η_{i})}) . (7)

As Eq. 7 returned a value in the range ( $- 180^{\circ}$ , $180^{\circ}$ ), the result was normalized to a compass bearing in the range ( $0^{\circ}$ , $360^{\circ}$ ). The transformed formula was as follows:

{\hat{α}}_{i j} = (α_{i j} + 360) % 360, (8)

where $%$ is (floating point) modulo. Then, moving clockwise in a circle, the east, south, and west directions had azimuths $90^{\circ}$ , $180^{\circ}$ , and $270^{\circ}$ , respectively.

4 Results

This section introduces the analysis results of the RMT algorithm, cross-correlation algorithm, and CCM algorithm. First, we statistically analyze the distribution of the direction and distance of the monitoring stations. Secondly, the RMT algorithm is used to calculate the association between the monitoring stations. Then, the correlation between the stations is analyzed using the cross-correlation algorithm. Finally, it shows the changes of CCM causality between different sites in different directions and distances.

4.1 Empirical Statistical Characteristics

4.1.1 Distribution of Azimuth α

As aforementioned, one of the main aims of this work was to study correlations and causalities between $P M_{2.5}$ monitoring station data sequences in different directions. Therefore, the distribution of $P M_{2.5}$ monitoring site directions was significant. This section discusses the distribution of the azimuth α. As can be seen in Figures 1A–C, the azimuth α distribution for the three different data sets (Global, China and the United States) had peaked at different values of α. Figure 1A reports the four main peaks for α in the global data, which was the largest dataset at around $45^{\circ}$ . Most neighbors were located in the northwest section of the region. Similarly, many neighbors were located in the east ( $\approx 90^{\circ}$ ), northeast ( $\approx 60^{\circ}$ ), and southwest ( $\approx - 110^{\circ}$ ) sections.

FIGURE 1

FIGURE 1. The distribution of azimuth α and distance d. (A) shows the distribution of Azimuth a for the global dataset, (B) shows the distribution of the Chinese data set, and (C) shows the distribution of the American data set. (D) presents the distribution of distance d for the global station; (E) presents the distribution for the Chinese station, and (F) presents the distribution for the American station.

Additionally, Figure 1B shows that cities neighboring each other in China were generally distributed in the northeast and southwest directions of the city. Further, Figure 1C demonstrates that the neighboring cities in the $U S A$ United States were generally located to the east and west. These results were in agreement with the distributions of urban belts globally, in China, and in the United States. It should, however, be noted that most of the $P M_{2.5}$ detection sites were located in urban, i.e., densely populated areas; $P M_{2.5}$ monitoring in non-urban or sparsely populated areas was needed. Nonetheless, this study’s examination of α distribution diversity provided an understanding of the influence of α on the associations, correlations and causalities of $P M_{2.5}$ sequences. These results help to understand the distribution of the direction as a whole and avoid the deviation of the conclusion caused by the statistical differences of the direction in the subsequent data analysis.

4.1.2 Distribution of Distance d

To study trends regarding correlations and causalities between $P M_{2.5}$ sites and the distances between the monitored locations, the distance distributions of the sites were needed. Figures 1D,E illustrate these distributions with distance d for the three different data sets; it was found that there were peaks for different values of d. The distance distribution for the global data set had two peaks, while the distance distributions for the Chinese and American data sets had only one peak each. The two peaks noted in the global site distribution (Figure 1D) suggested that the monitoring sites were distributed throughout two relatively concentrated places. In contrast, the single-peak distributions of China and the United States (Figures 1E,F) showed that the monitoring sites were relatively concentrated and closely connected. Nevertheless, these distributions were not perfectly bimodal or normal. One reason for this is that the distributions of the monitoring sites conformed to non-uniform population distributions. These results help to understand the distance distribution of the monitoring stations, and at the same time avoid the deviation of the conclusion caused by the difference in distance distribution in the subsequent data analysis.

4.2 Random Matrix Theory

The RMT was helpful for comparing the properties of a null hypothesis purely random matrix (a strictly independent and identically distributed random time series) to those of the empirical correlation matrix C. Deviations from the purely random matrix could suggest the presence of underlying interactions [37, 38]. The RMT method was thus used to study the statistical properties of C in regards to the cross correlations of $P M_{2.5}$ changes. Initially, the elements of C were from Eq. 1 when $τ = 0$ ; then, they degenerated into Pearson’s correlation coefficient ρ from a two-time series. Figures 2A–C demonstrates the distribution of ρ for the global, China and United States data sets, respectively. The means, $\bar{ρ}$ s, and standard deviations, σs, of the three distributions were as follows: $\bar{ρ} \approx 0.04$ and $σ = 0.14$ for global sites; $\bar{ρ} \approx 0.11$ and $σ = 0.11$ for the Chinese sites; and $\bar{ρ} \approx 0.04$ and $σ = 0.08$ for the American sites. Thus, a clear deviation could be seen between the distribution of ρ and the curve fit by the normal distribution. These results indicated that the $P M_{2.5}$ time series associations for the data were not completely random [37, 38]. Future research should investigate the causes of these non-random associations, such as whether they were affected by climatic conditions. These non-normal distributions suggested the existence of non-trivial relationships between detection sites.

FIGURE 2

FIGURE 2. The statistical properties of the cross correlation matrix C for the $P M_{2.5}$ time series. (A) presents the distributions of ρ ( $τ = 0$ ) for the global data, (B) presents the distributions for the Chinese data, and (C) presents the distributions for the American data, respectively. In (D–F), the green and red bars represent the probability distributions of the eigenvalues from the real data and the random series, respectively. The black dashed-dotted line is the theoretical result of Eq. 4.

As aforementioned, C was diagonalized to obtain λ eigenvalues. Finally, the distributions of the eigenvalues from the empirical time series were considered in regards to the finite strictly independent and identically distributed random time series. Figures 2D–F represent the distributions of the eigenvalues of real (green bar) and random (red bar) time series globally (d), for China (e), and for the United States (f) with the length $L_{d}$ (see Table 1). As can be seen, there were dramatic differences between the random series and the real $P M_{2.5}$ time series. The results are qualitatively similar to those of earlier studies about the global crude oil market and the global stock market, in which they observed that the largest eigenvalue reflects the collective effect of the global market, the second to fifth largest eigenvalues can distinguish six clusters, and the smaller eigenvalues portray the time series pair with the largest correlation coefficient [46, 47].

Moreover, the RMT predictions indicated that the distributions of the eigenvalues should follow the black dash line, which shows a distinct deviation from the real $P M_{2.5}$ time series but it is in good agreement with mimic random series. This result indicates that there are deviations in the eigenvalue distributions of the correlation matrix from the empirical data from purely random time series, implying that $P M_{2.5}$ time series are not purely random but with a finite amount of association. These findings were consistent with results obtained by examining the distribution of ρ, as shown in Figures 2A–C. However, this work represented only a preliminary attempt to identify the associations of real $P M_{2.5}$ time series. The actual relationship may be more complex, and the underlying mechanism of the associations was not within the scope of the RMT. This indicates that there is a non-random association between $P M_{2.5}$ sites, suggesting that there is some association between our $P M_{2.5}$ sequences, for example, the correlation changes with distance [17–21]. Furthermore, future research should focus on the time-lag cross-correlations RMT, because this method focuses on the magnitudes of the sequence, so that the method can quantitatively mine the long-range collective movements hidden behind the short-range correlation features [48].

4.3 Cross-Correlation Analysis

4.3.1 Illustration of Cross-Correlation

To characterize the transport dynamics of $P M_{2.5}$ , this study aimed to determine the various hidden relationships between the distinct time-series measurement nodes. A straightforward method was used: cross-correlation. It had been used in research on pollution transports to detect Rossby Waves, teleconnection paths, and El Ni $\tilde{n}$ o impacts [18–20]. Relationships depended on distance, and with the growth of distance, relationship values followed some specific features (e.g., monotonic decreases or increases to relation values, distances that showed concentration values, and more). To understand the relationships between distances (radial) and angles (azimuth αs) from a site (i.e., a pole or reference point), the polar coordinate system was employed (see Methods). For example, given the $P M_{2.5}$ time series recorded from a large number of monitoring stations (nodes) in a given geographical region, a distance and azimuthal angle could be calculated for each nodal pair. The value of the correlation could then be represented in terms of color in the polar coordinates for distance and angle.

Initially, it was demonstrated that cross-correlation was associated with distance, as shown in panels $(a)$ , $(b)$ and $(c)$ of Figure 3. Further, Figure 3A was a representative case in which correlation decreased with distance, regardless of direction. Finally, Figure 3B showed correlations that were random in regards to both distance and direction, and could have been a result of a completely random time series. In contrast, Figure 3C revealed that as the distance increased, the correlations between the stations became increasingly stronger but did not suggest differences in regards to direction. These figures thus indicated the different types of relationships between distance and cross-correlation.

FIGURE 3

FIGURE 3. The correlations and polar coordinates for $P M_{2.5}$ cross correlations globally, in China, and in the USA. (A) shows that cross correlations decreased with distance, regardless of direction; correlation strength is represented by the color of the bar. (B) shows the cross correlation distributions for the random time series; a random relationship was present between cross correlation and distance. (C) shows that cross-correlations increased with distance, regardless of direction; correlation strength is represented by the color of the bar. (D–F) show Pearson’s correlation coefficient ρ for all possible nodal pairs worldwide, in China, and the United States, respectively; distance is plotted on a logarithmic scale. Finally, (G–I) show the cross correlation $W_{m a x}$ for all possible nodal pairs worldwide, in China, and in the United States, respectively. Angles indicate directions between pairs of nodes, and radii indicate distances between pairs of nodes on a logarithmic scale. Dot colors show relationship values.

4.3.2 Pearson’s Correlation Coefficient ρ in Polar Coordinates

As aforementioned, one aim of this study was to detect variations in correlations between different sites with distance d and direction $\hat{α}$ (see Methods). Pearson’s correlation coefficient was a classic method of measuring a correlation between two sequences. Using the method, a distribution was thus obtained for Pearson’s correlation coefficients (ρ) for the three different data sets in the polar coordinate system $(d, \hat{α})$ . According to Figure 3D, the global data had a clear anti-correlation between Pearson’s correlation coefficient ρ and distance d; that is, the shorter the distance, the stronger the correlation. However, Pearson’s correlation coefficient ρ did not have significant differences in regards to the different directions, $\hat{α}$ . Similar results were observed for both China and the United States, as shown in Figures 3E,F. However, the anti-correlation characteristics of ρ and d for the United States and China data were not as apparent as those in the global data, which may have been due to a relatively small number of observation stations in the United States and China. In the future, more sites and further detailed data are needed to study the relationships between ρ, $\hat{α}$ , and d. This result, on a larger scale and more source data, validates the inverse correlation between $P M_{2.5}$ sequences and distances found in previous studies [9].

4.3.3 Cross-Correlation of $W_{m a x}$ in Polar Coordinates

In recent climate network models articulated to study the spatiotemporal behavior of the climate system, nodes denote geographical coordinate sites and a link between a pair of nodes is defined by the cross-correlation and mutual information between the time series from the two nodes [17–20]. Following this approach, one related type of cross-correlation, denoted as $W_{m a x}$ (see Methods) was calculated: so-called positive link weights between $P M_{2.5}$ recordings and pairs of nodes for the $L_{h}$ time series. Detailed information on the ( $L_{h}$ ) data is shown in Table 1 and the Methods section.

As aforementioned, one goal of this study was to determine whether there were long-range correlations between the $P M_{2.5}$ recordings from the different sites. To accomplish this goal, a polar representation of cross-correlation was used. Specifically, for any nodal pair, the distance d and the azimuthal angle $\hat{α}$ could be defined (e.g., the zero angle $0^{\circ}$ meant that one node was exactly north of another node, and $90^{\circ}$ indicated that one node was east of another node). The correlation between the nodal pair was then color coded and represented in the polar coordinates $(d, \hat{α})$ .

In Figure 3G and in regards to worldwide $P M_{2.5}$ s, $W_{m a x}$ correlation values were represented by color-coded dots in the polar coordinates. Larger values were noted for the positive link weight $W_{\max}$ at short distances (fewer than hundreds of km). Small $W_{\max}$ values were distributed approximately uniformly at larger distances and in other directions. This phenomenon was in agreement with climate network results and was called the proximity (distance) effect [18–20]. Namely, pairs of sites close to each other (fewer than 2,000 km apart) were often strongly positively correlated [18, 21].

However, there were no long-range correlations for the $P M_{2.5}$ time series. Similar results were obtained for the China and American $P M_{2.5}$ s, as shown in Figures 3H,I, respectively. While this was not ideal, a few points had the large positive link weight $W_{\max}$ distributed across a long-distance range. The reasons for these outliers were not the focus of this paper, but future research should investigate these anomalies with more detailed data. The phenomenon implied that $P M_{2.5}$ could not transmit at long ranges. Nonetheless, these results did not seem to depend on the length of the time series, insofar as it was reasonable, as shown in Figures 3G–I for the respective data sets. Moreover, the results indicated only short-range correlations and a lack of long-range correlations, which suggested that $P M_{2.5}$ could transport across only short distances (fewer than hundreds of km) [9]. This result was in sharp contrast, for example, to climate phenomena in teleconnections [11, 12] and temperature [18–20]. These results indicate that the $P M_{2.5}$ sequence and the temperature etc. meteorological sequences are different, and there is no teleconnections. Furthermore, our results only display the lack of long-range correlation between different sequences in the spatial distance range, but a large number of previous studies including detrended fluctions analysis (DFA), Detrended Cross-Correlation Analysis (DCCA) and multifractal detrended fluctuation analysis (MFDFA) have observed that there is a long-range correlation in the time dimension [49–51]. Therefore, future research should pay attention to the correlation of $P M_{2.5}$ sequence at the time dimension, so as to be able to deeply understand the trend of $P M_{2.5}$ . Moreover, future research should focus more on the deeper causes of different transmission phenomena, for example, the difference in transmission media [52–55].

4.4 Causality Analysis

4.4.1 The Distribution of Causation $Δ$

Although previous research expressed non-trivial correlations between monitoring stations in regards to distance, the detection of causality between stations had always been a problem of great theoretical significance and practical value [39–42]. In general, causation and correlation are not equivalent to each other: two time series can be highly correlated but without any causal relation [27]. However, this study applied CCM to the $P M_{2.5}$ data to determine the existence of causal relationships between the time series and the different geographical locations [31]. This method is suitable for nonlinear systems in the presence of noise [31, 56].

In particular, for a nodal pair, non-zero $Δ$ value could indicate a causal relationship between the $P M_{2.5}$ time series. First, the relative strength of causation $Δ$ was computed (see Methods) for the three datasets; their representative distributions can be seen in Figures 4A–C. These distributions showed that the causation $Δ$ distribution fit the data well in regards to normal distributions of the global, Chinese, and American data, respectively. It was found that the means, $\bar{Δ}$ s, and standard deviations, $\bar{σ}$ s, of the three distributions were as follows. $\bar{Δ} \approx 0.00$ and $σ = 0.02$ for the global data; $\bar{Δ} \approx 0.00$ and $σ = 0.02$ for the Chinese data; and $\bar{Δ} \approx 0.00$ and $σ = 0.02$ for the American data. These perfect normal distributions suggested that there may be no significant causality between the $P M_{2.5}$ sequences at the collective level. Although no clear causality was observed at the collective level, this is a useful exploration of the causal-effect in the $P M_{2.5}$ sequence data. Causality detection would have important roles in future research on climate and environmental policies [28–30].

FIGURE 4

FIGURE 4. The causality analysis of the $P M_{2.5}$ time series. (A–C) show the $Δ$ distributions for the global, Chinese, and American stations, respectively. The bar graph shows the distribution characteristics of the real data, and the black dotted line shows the normal fit of the curve. (D–F) show azimuth α in comparison to $Δ$ regarding the global, China, and American data, respectively. Finally, (G–I) compare distance d to $Δ$ regarding the global, Chinese, and American data, respectively. The violin plots present the overall distribution characteristics of each data group. The upper and lower ends are the top and the bottom of each violin’s distribution, and the middle lines are the means of each violin’s distribution. The gray dots are a scatterplot of the original data.

4.4.2 Causation $Δ$ in Comparison to Azimuth α

In order to deeply understand the relationship between causation $Δ$ and direction α between $P M_{2.5}$ monitoring sites, we examined the trend of causality with the directions of the sites. It is apparent that in all three data, causality is basically symmetrically distributed in different directions. In particular, Figure 4D shows that the mean value of $Δ$ followed the straight line $Δ = 0$ . This result indicated that at the collective level, the causalities and directions between the different $P M_{2.5}$ stations were irrelevant. The dense distribution in some directions, e.g. $α \approx 45$ in Figure 4D was consistent with the distribution characteristics in Figure 1A. Additionally, Figure 4E,F indicated similar relationships concerning China and the United States. It was thus concluded that there is no apparent causal-effect trend among the different sites in the different directions. It is apparent that one cannot simply judge the causality of $P M_{2.5}$ observation sites based on their direction.

4.4.3 Causation $Δ$ in Comparison to Distance d

As aforementioned, a non-trivial correlation was present between the monitoring sites as distance d changed. This study thus investigated causality trends between the monitoring sites and distance d. Figure 4G revealed that as distance d changed, causation $Δ$ was almost evenly distributed on both sides of the central $Δ = 0$ axis. This result clearly indicated that at the collective level, there were no relations between causality and distance d. A similar result was observed in both cases from China (Figure 4H) and the United States (Figure 4I). Although no consistent causality was observed at the collect level, a relative causality seemed likely at the individual level. Future research should therefore investigate this at the individual level, particularly the impact of $P M_{2.5}$ sequence length. This result indicates that there is no definite conclusion about the causality and distance between $P M_{2.5}$ monitoring stations, and they should be studied separately according to the different situations of $P M_{2.5}$ stations.

5 Conclusion

With the development of data collection technology, more and more studies focus on $P M_{2.5}$ sequence, some studies focus on the impact of $P M_{2.5}$ sequence on climate and meteorology in some places and periods, and others focus on the correlation of $P M_{2.5}$ sequence and distance. However, these studies rarely focus on the various relationships of $P M_{2.5}$ in a large-scale spatiotemporal range, not to mention the causal-effect between $P M_{2.5}$ sequences. We have conducted a study of the relations of $P M_{2.5}$ time series based purely on massive data. Our statistical and nonlinear analysis of worldwide $P M_{2.5}$ time series over the period of one year leads to a number of findings. Firstly, a random matrix based analysis indicates that the spatiotemporal $P M_{2.5}$ data are not purely random, but with associations. Secondly, correlation among the $P M_{2.5}$ time series exists over a short distance, which is consistent with the first finding. However, there is lack of consistent Long-range cross correlation, suggesting that transport of $P M_{2.5}$ over long distance is complex and changeable. Thirdly, since correlation does not imply causation in general, a causality analysis is necessary to assess the likelihood of Long-range transport, which we carry out by employing the nonlinear dynamics based CCM method. The analysis reveals an unequivocal absence of consistent indication of transport of $P M_{2.5}$ over long distance (e.g., over 1,000 km). The simultaneous absence of consistent long-range correlation and statistical causation leads to the conclusion that transport of $P M_{2.5}$ over long distance is sophisticated and varied.

It should be noted that these analyses were only statistical. However, for numerous pairs across a significant distance, a lack of consistent and definitive casual relations was found (as shown in Figures 4D–I. That is, over the given distance, the direction of causal interaction appeared completely random, and this implied a lack of causation in general. These findings extended prior work about $P M_{2.5}$ series to a large spatial-temporal scale with causality analysis [30]. Moreover, an absence of consistent long-range transport was found for three different data sets, both with cross-correlation analyses and causality detection methods. This finding is promising, however, it should be explored using other association and causal analyses as well as different data sets.

Generally, $P M_{2.5}$ pollution affects only a short distance (no more than several hundred km); policies should be changed to address this. However, while this study was an empirical analysis of direct $P M_{2.5}$ time series collected from various monitoring stations, it was only a preliminary attempt to identify the associations, correlations and causal effects of real $P M_{2.5}$ time series. Actual relationships may be more complex, and the underlying mechanisms that cause these relationships were not within the scope of this study. Our results demonstrate that the random matrix theory can also play an important role in $P M_{2.5}$ sequences. Future research should pay more attention to the specific meaning of eigenvalues and their corresponding eigenvectors [46, 47]. Meanwhile, the research on magnitude based on the time-lag cross-correlations RMT is also a very promising direction [48]. In addition, future research should focus on the correlation of $P M_{2.5}$ in the time dimension [49–51]. Moreover, this analysis did not investigate other pollutants, such as $C O_{X}$ , $N O_{X}$ , and $S O_{X}$ , etc [10], nor did it consider meteorological variables, such as temperature, relative humidity, precipitation, cloud cover, wind speed, and wind direction [57]. Additionally, it did not conduct mineralogical composition analyses [52–55], Total Ozone Mapping Spectrometers [54], or climate models and biogeochemical interactions [8]. Further research should investigate these topics and examine the role of $P M_{2.5}$ in the spread of disease, especially concerning the recent impact the coronavirus (COVID-19) has had on the world [58]. Nonetheless, this research is significant as a basis for these future researches.

Data Availability Statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: http://berkeleyearth.org/data/, https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/24826, https://www.epa.gov/outdoor-air-quality-data.

Author Contributions

Devised the research project: Z-DZ, NZ, and NY; Performed numerical simulations: Z-DZ; Analyzed the results: Z-DZ, NY, and NZ; Wrote the paper: Z-DZ.

Funding

This work is partially supported by the Scientific Research Foundation of Shantou University (Grant No. NTF19015), the 2020 Li Ka Shing Foundation Cross-Disciplinary Research (Grant No. 2020LKSFG09D), the Guangdong Basic and Applied Basic Research Foundation (Grant No. 2021A1515012294), the National Key Research and Development Program of China (Grant No. 2016YFA0602503) and National Natural Science Foundation of China (No.62066048). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

1. Buseck PR, Posfai M. Airborne Minerals and Related Aerosol Particles: Effects on Climate and the Environment. Proc Natl Acad Sci (1999) 96(7):3372–9. doi:10.1073/pnas.96.7.3372

PubMed Abstract | CrossRef Full Text | Google Scholar

2. Samset BH. How Cleaner Air Changes the Climate. Science (2018) 360(6385):148–50. doi:10.1126/science.aat1723