What can go wrong when observations are not independently and identically distributed: A cautionary note on calculating correlations on combined data sets from different experiments or conditions

Saccenti, Edoardo

doi:10.3389/fsysb.2023.1042156

METHODS article

Front. Syst. Biol., 30 January 2023

Sec. Data and Model Integration

Volume 3 - 2023 | https://doi.org/10.3389/fsysb.2023.1042156

This article is part of the Research TopicGood Practice in Data Analysis and IntegrationView all articles

What can go wrong when observations are not independently and identically distributed: A cautionary note on calculating correlations on combined data sets from different experiments or conditions

Edoardo Saccenti*

Laboratory of Systems and Synthetic Biology, Wageningen University and Research, Wageningen, Netherlands

In the scientific literature data analysis results are often presented when samples from different experiments or different conditions, technical replicates or times series are merged to increase the sample size before calculating the correlation coefficient. This way of proceeding violates two basic assumptions underlying the use of the correlation coefficient: sampling from one population and independence of the observations (independence of errors). Since correlations are used to measure and infer associations between biological entities, this has tremendous implications on the reliability of scientific results, as the violation of these assumption leads to wrong and biased results. In this technical note, I review some basic properties of the Pearson’s correlation coefficient and illustrate some exemplary problems with simulated and experimental data, taking a didactic approach with the use of supporting graphical examples.

1 Introduction

The Pearson’s correlation coefficient (Pearson, 1895; Spearman, 1907) is certainly one of the most popular measures of association used in biology and in the Life Sciences. Unfortunately, it is also one of the most misused. Recently, several papers have been brought to my attention by collaborators in which the sample correlation coefficient is calculated following questionable practices, in particular when data from different experiments or conditions are combined before calculating the correlation. The goal of this cautionary note is to show what happens when the basic assumptions underlying the calculation and the use of the sample correlation are not met.

To set the scene, I start by introducing some notation and by recalling some basic statistical principles. Taken n observations (x₁, x₂, … , x_n) of a variable x and n observations (y₁, y₂, … , y_n) of a variable y, the Pearson’s sample correlation coefficient r^(xy) (to which I will term “sample correlation” for sake of simplicity) is defined as

r^{(x y)} = \frac{C^{(x y)}}{\sqrt{V^{(x)}} \sqrt{V^{(y)}}}, (1)

where V^(x) is the sample variance¹

V^{(x)} = \frac{1}{n - 1} \sum_{i = 1}^{n} {(x_{i} - M_{n}^{(x)})}^{2} V^{(y)} = \frac{1}{n - 1} \sum_{i = 1}^{n} {(y_{i} - M_{n}^{(y)})}^{2}, (2)

and C^(xy) is the sample covariance:

C^{(x y)} = \frac{1}{n - 1} \sum_{i = 1}^{n} (x_{i} - M_{n}^{(x)}) (y_{i} - M_{n}^{(y)}) (3)

with $M_{n}^{(x)}$ and $M_{n}^{(y)}$ the sample mean for x and y:

M_{n}^{(x)} = \frac{1}{n} \sum_{i = 1}^{n} x_{i} M_{n}^{(y)} = \frac{1}{n} \sum_{i = 1}^{n} y_{i} . (4)

Let’s now recall now some facts about the sample correlation r^(xy) (Eq. 1): it can always be calculated, however its validity and (correct) interpretation (including its statistical significance, as expressed by the associated p-value) rest on several statistical assumptions. I will focus here on the two assumptions stating that the observations used to calculate the correlation must be identically and independently distributed; these formulated as.

A1 All the (x₁, x₂, … , x_n) observations of x (and y) are sampled from the same (normal) distribution.

A2 The (x₁, x₂, … , x_n) observations of x (and y) are independent (independence of errors).

To be of any significance, and to be able to make inference on population parameters, the observations must be randomly selected, that is must be a representative random sample of a larger population. In addition, the relationships between x and y must be linear, since (Eq. 1) cannot account for non-linear relationships. Chapter 32 of Motulsky. (2014) offers a low level yet very precise presentation of all the assumptions underlying the use of the correlation coefficient.

This note deals with the problems arising when assumptions A1 and A2 are violated, that is when observations are non-independently and non-identically distributed. Consequences of the violations of other assumptions, like deviation from normality of the observations, have been discussed elsewhere (Calkins, 1974; Havlicek and Peterson, 1976; Havlicek and Peterson, 1977; Wilcox, 2009). The papers by Schober et al. (2018) and Janse et al. (2021) discuss pitfalls and interpretative problems of correlations.

In what follows, all observations are well-behaved and follow a normal (Gaussian) distribution:

x_{i} \sim N (μ, ν^{2}) (5)

where μ and ν are the population mean and standard deviation (same for variable y).

2 Three research scenarios when correlations are wrongly calculated

I will consider three research scenarios often found in literature in which data is manipulated in some way before the correlation among measured variables is calculated.

RS1 A researcher has two data sets A and B which contain measurements of the variables x and y. Data set A contains n observations of x and y, while data set B contains m observations. A and B are combined into one data set containing n + m observations and the Pearson’s sample correlation (Eq. 1) is calculated between x and y. The reasons for merging the two data sets can be different. Data sets A and B can come from different batches measured during an experiment, or even from different experiments. Often the data sets are merged to increase the sample size with the idea (wrong, in this case) of obtaining a more a reliable estimation of the correlation between x and y.

RS2 A researcher has performed an experiment where a large number of variables have been measured over two conditions, on n observations for Condition A and m observation for Condition B, like, for example, in the case of a transcription experiment where thousands of gene expressions have been measured on case and control samples. To reduce the dimensionality of the problem, the researcher restricts the analysis to those genes that are differentially expressed between the two conditions. Moved by the interest of understanding regulatory mechanisms, the researcher decides to build a correlation network using all the measurements (observations) available for those genes (let’s call two of such genes x and y) that are differentially expressed.

RS3 A researcher has measured variable x and y several times on the same subjects, obtaining m measures for each subject. A typical case is when technical replicates are measured for each (or some of the) sample, usually in duo or triplicates, or when time series are acquired. To increase the total sample size, they then decide to combine all the n × m observations and calculate the correlation over the n × m observation of x and y.

The first two scenarios RS1 and RS2, albeit different, can be schematized in the same way, as shown in Figure 1.

FIGURE 1

FIGURE 1. Research scenarios RS1 and RS2. Graphical illustration of two data sets A and B (depicted as blocks of different color) containing measurements/observations of two variables x and y measured on two different conditions. Data set A contains n observations, while data set B contains m observations. The data set obtained by merging A and B row-wise (on top of the other) has dimension (n + m) × 2.

2.1 Violation of sampling from one population

Scenarios RS1 and RS2 entail the calculation of sample correlations after merging of two (or more) data sets: I wish to illustrate the problems that arise from such a way of operating with a simple simulated example. Figure 2A shows the correlation plot of n + m = 200 observations of variables x and y. A positive linear relationship seems to exist between x and y: the sample correlation coefficient is $r_{(n + m)}^{x y} = 0.76$ (P-val $< 1 0^{- 5}$ ). Equipped with this rather strong correlation and statistical significance, the researcher may claim association between the two variables and build a story around it, explaining and discussing the biological relevance of it.

FIGURE 2

FIGURE 2. Research scenarios RS1 and RS2. (A) Scatter plot of n + m = 200 observations of two variables x and y: the Pearson’s sample correlation between x and y is $r_{n + m}^{(x y)} = 0.76$ . (B) The same scatter plot as in Panel A but with data points color coded to highlight the actual data structure: when taken separately the n = m = 100 observations of x and y (Condition A: blue; condition B: red) are uncorrelated: $r_{n}^{(x y)} = 0.01$ (data set A) and $r_{n}^{(x y)} = 0.08$ (data set B). The observed high correlation $r_{n + m}^{(x y)} = 0.76$ is an artifact due to the merging of two data set containing variables coming from two different populations: in this case x and y come from two independent normal distributions with population means μ_A = (43, 43) and μ_B = (40, 40) and unit variance ν² = 1.

The problem becomes evident when we look at how the data presented in Figure 2B has been built. I have proceeded as described in research scenarios RS1 and RS2, merging two data sets containing n = 100 and m = 100 observation of x and y and then I have calculated the correlation between x and y. The reality is that x and y are not correlated at all: when the two data sets (conditions) are considered separately, the correlation between x and y is zero, since the generating mechanism of the data shown in Figure 2A is the following:

\begin{align} (x_{1 < i < n}, y_{1 < i < n}) \sim N (40,1) \\ (x_{n + 1 < i < n + m}, y_{n + 1 < i < n + m}) \sim N (43,1), \end{align} (6)

which generates variables x and y that are independent and uncorrelated. The second scenario RS2 is very often encountered in papers dealing with the analysis of very large omics data sets. This way of proceeding is also problematic. In fact, when the researcher looks for differentially expressed genes, (or for metabolites with different concentrations), they perform some statistical test to compare the observed means of variable x and y in condition A versus condition B (in this case a t-test, for instance), testing the Null hypothesis (similar considerations hold for variable y):

H_{0} : μ_{A}^{(x)} = μ_{B}^{(x)}, (7)

against the alternative

H_{1} : μ_{A}^{(x)} \neq μ_{B}^{(x)} . (8)

There is of course no problem using the t-test to find genes that are differentially expressed (even if more powerful approaches have been introduced for this type of data). The problem arises when the differentially expressed genes are used to compute correlations. Selecting the variables for which H₀ is rejected is the equivalent of selecting variables for which the distribution of x is different between two conditions. Stated in other words, by doing so the researcher is looking specifically for those variable that violates the assumption of sampling from one distribution!

If the reader thinks that these are just simulated numerical examples, it is not complicated to show that such problematic situations can be easily encountered when using real-life experimental data. Figure 3 shows a case similar to the one given in Figure 2, this time obtained using data from a transcriptomic study: the expression profiles of two genes x and y, measured at two different conditions in a case-control scenario, are uncorrelated ( $r_{n}^{(x y)} = 0.1$ , $r_{m}^{(x y)} = - 0.1$ ) when the two conditions are considered separately Figure 3A. However, they become correlated ( $r_{n + m}^{(x y)}$ , P-val $< 1 0^{- 5}$ ) (Figure 3B) if the correlation is taken over all the observations combined, i.e., when the two data sets are combined.

FIGURE 3

FIGURE 3. Real life research scenario RS2. (A) Scatter plot of the expression values of two genes x and y: the Pearson’s sample correlation between gene x and y is $r_{n + m}^{(x y)} = 0.66$ . (B) The same scatter plot as in Panel A but with gene expression color coded to highlight the actual data structure: when taken separately the expression of genes x and y (Condition A: blue; condition B: red) are uncorrelated: $r_{m}^{(x y)} = - 0.1$ and $r_{n}^{(x y)} = 0.1$ , respectively. The observed correlation $r_{n + m}^{(x y)} = 0.66$ is an artifact due to the merging of the expression profile of two genes that have been found to be differentially expressed between condition A and B (case and controls) (P-val $< 1 0^{- 5}$ ) violating, in this way, the assumption of sampling from the same distribution. Data from (Li et al., 2014). Note that there is nothing wrong with analysis of the data in the original publication: here the data has been only used to build an illustrative example.

2.2 A closer look to the mathematics of the problem

This section presents a mathematical explanation of what observed in Figures 2, 3. Let’s define the overall mean $M_{m + n}^{(x)}$ over all the (x₁, x₂, … x_n+m) observations of x (identical formulas hold for y)

M_{n + m}^{(x)} = \frac{1}{n + m} \sum_{i = 1}^{n + m} x_{i}, (9)

and the partial means $M_{n}^{(x)}$ and $M_{m}^{(x)}$ over the (x₁, x₂, … x_n) and (x_n+1, x_n+2, … x_n+m), as

M_{n}^{(x)} = \frac{1}{n} \sum_{i = 1}^{n} x_{i} (10)

M_{m}^{(x)} = \frac{1}{m} \sum_{i = n + 1}^{n + m} x_{i}, (11)

with similar definition for variable y. The overall variance $S_{n + m}^{(x)}$ of x_n+m is given by (Chan et al., 1982)

V_{n + m}^{(x)} = \frac{1}{n + m - 1} \sum_{i = 1}^{n + m} {(x_{i} - M_{m + n}^{(x)})}^{2}, (12)

and the partial variances $S_{n}^{(x)}$ and $S_{m}^{(x)}$ by

\begin{align} V_{n}^{(x)} = \frac{1}{n - 1} \sum_{i = 1}^{n} {(x_{i} - M_{n}^{(x)})}^{2} \\ V_{m}^{(x)} = \frac{1}{m - 1} \sum_{i = n + 1}^{n + m} {(x_{i} - M_{m}^{(x)})}^{2} . \end{align} (13)

The total variance $S_{n + m}^{(x)}$ taken over all the observations of x can be expressed as function of the partial variances (Eq. 13) of the subsets

V_{n + m}^{(x)} = \frac{1}{n + m - 1} [(n - 1) V_{n}^{(x)} + (m - 1) V_{m}^{(x)} + \frac{n m}{n + m} {(M_{n}^{(x)} - M_{m}^{(x)})}^{2}] . (14)

The covariance $C_{m + n}^{(x y)}$ between the n + m observations of x and y can be expressed in term of the partial covariance $C_{n}^{(x y)}$ and $C_{m}^{(x y)}$ between the n (and m) observations of x and y:

C_{n + m}^{(x y)} = \frac{1}{n + m - 1} [(n - 1) C_{n}^{(x y)} + (m - 1) C_{m}^{(x y)} + \frac{n m}{n + m} (M_{n}^{(x)} - M_{m}^{(x)}) (M_{n}^{(y)} - M_{m}^{(y)})] . (15)

By combining (Eq. 15) with (Eqs 13, 14) it is possible to obtain the sample correlation $r_{n + m}^{(x y)}$ between n + m observations x and y as a function of the correlation $r_{n}^{(x)}$ and $r_{m}^{(x)}$ calculated on the two subsets (Hayes, 2012):

\begin{align} r_{n + m}^{(x y)} & = \frac{C_{n + m}^{(x y)}}{\sqrt{V_{n + m}^{(x)}} \sqrt{V_{n + m}^{(y)}}} \\ = \frac{1}{n + m - 1} \frac{(n - 1) C_{n}^{(x y)} + (m - 1) C_{m}^{(x y)} + \frac{n m}{n + m} (M_{n}^{(x)} - M_{m}^{(x)}) (M_{n}^{(y)} - M_{m}^{(y)})}{\sqrt{V_{n + m}^{(x)}} \sqrt{V_{n + m}^{(y)}}} \\ = \frac{n - 1}{n + m - 1} r_{n}^{(x y)} + \frac{m - 1}{n + m - 1} r_{m}^{(x y)} \\ + \frac{1}{n + m - 1} \frac{n m}{n + m} \frac{(M_{n}^{(x)} - M_{m}^{(x)}) (M_{n}^{(y)} - M_{m}^{(y)})}{\sqrt{V_{n + m}^{(x)}} \sqrt{V_{n + m}^{(y)}}} . \end{align} (16)

Eq. 16 shows that the correlation between x and y taken over the full data set is a weighted sum of the correlations $r_{n}^{(x y)}$ and $r_{m}^{(x y)}$ taken over the two subsets plus and additional term

Δ r_{n + m}^{(x y)} = \frac{1}{n + m - 1} \frac{n m}{n + m} \frac{(M_{n}^{(x)} - M_{m}^{(x)}) (M_{n}^{(y)} - M_{m}^{(y)})}{\sqrt{V_{n + m}^{(x)}} \sqrt{V_{n + m}^{(y)}}} . (17)

The term $Δ r_{n + m}^{(x y)}$ does not depends on the correlation between x and y, but only on the difference between the mean value of x and y in the two sub sets. As a consequence, even if $r_{n}^{(x y)}$ and $r_{m}^{(x y)}$ are zero, the correlation taken over the merged data set is different from zero if x and y have not the same average $M_{n}^{(x)}$ and $M_{m}^{(x)}$ and $M_{n}^{(y)}$ and $M_{m}^{(y)}$ in the two sub sets. This is exactly what happens in the examples shown in Figures 2, 3. Working out the calculations for data in Figure 2 we have:

\begin{align} n = m = 100 \end{align} (18)

\begin{align} M_{n}^{(x)} = 43.2 M_{n}^{(y)} = 43.2 M_{m}^{(x)} = 39.9 M_{m}^{(y)} = 40.1 \\ V_{n}^{(x)} = 0.92 S_{n}^{(y)} = 0.86 V_{m}^{(x)} = 0.77 V_{m}^{(y)} = 0.96 \\ M_{n + m}^{(x)} = 41.5 M_{n + m}^{(x)} = 41.6 \\ V_{n + m}^{(x)} = 3.5 S_{n + m}^{(x)} = 3.3 \\ r_{n}^{(x y)} = 0.08 r_{m}^{(x y)} = 0.01 \\ r_{n + m}^{(x y)} = 0.76 . \end{align} (19)

These numerical results are consistent with those we obtained for the synthetic data sets, where uncorrelated observations of x and y were generated according to:

\begin{align} (x_{1}, x_{2}, \dots x_{n}) \sim N (40,1) (y_{1}, y_{2}, \dots y_{n}) \sim N (40,1) \\ (x_{n + 1}, x_{n + 2}, \dots x_{n + m}) \sim N (43,1) (y_{n + 1}, y_{n + 2}, \dots y_{n + m}) \sim N (43,1), \end{align} (20)

with n = m = 100. Eq. 16 shows that the contrary is also possible: two variables can be correlated in two different data sets but uncorrelated when the correlation is taken over the two merged data sets: this is shown in Figure 4.

FIGURE 4

FIGURE 4. Research scenarios RS1 and RS2. (A) Scatter plot of n + m = 200 observations of two variables x and y: the Pearson’s sample correlation between x and y is $r_{n + m}^{(x y)} = 0.20$ . (B) The same scatter plot as in Panel A but with data points color coded to highlight the actual data structure: when taken separately the n = m = 100 observations of x and y (data set A, blue; data set B: red) are correlated: $r_{n}^{(x y)} = 0.72$ (data set A) and $r_{n}^{(x y)} = 0.72$ (data set B). The low observed correlation $r_{n + m}^{(x y)} = 0.20$ is an artifact due to the merging of two data set containing variables coming from different populations: in this case x and y come from two independent normal distribution with population means μ_A = (43, 40) and μ_B = (40, 40) and unit variance.

Eq. 16 also explains why the Pearson’s sample correlation is so sensitive to outliers, to the point that one single outlier is sufficient to pull a zero correlation to 1. Having one outlier is the equivalent of having one additional data set (or condition/class/group) with just m = 1 observations. As a result, Eq. 16 simplifies to

r_{n + 1}^{(x y)} = \frac{n - 1}{n} r_{n}^{(x y)} + \frac{1}{n + 1} \frac{(M_{n}^{(x)} - x_{n + 1}) (M_{n}^{(y)} - y_{n + 1})}{\sqrt{V_{n + 1}^{(x)}} \sqrt{V_{n + 1}^{(y)}}} . (21)

If the outlying observation (x_n+1, y_n+1) is very distant from the average of the other n observations, the resulting correlation can be severely inflated, as shown in Figure 5A. The reader could argue that the use of Spearman’s rank correlation (Spearman, 1904) would have avoided this problem, since Spearman’s correlation is less sensitive to outliers. This is certainly true, as shown in Figure 5B: the presence of an outlier, even if really far from the average of all other n, not outlying, observations, does not affect the correlation. However, this argument holds true only in case of a few outliers: if the number m of outliers increases, also the Spearman’s correlation increases, albeit less dramatically than in the case of Pearson’s correlation, but still to a significant extent, leading to claim the existence of a correlation between x and y when x and y are not correlated at all. This is shown in Figures 6A, B. The agreement between Pearson’s and Spearman’s indexes increases when the number of outliers increase since the Spearman’s index is not robust against a large number of outliers. As shown in the plot, when there are one to two outliers, the Pearson’s coefficient gives a large correlation while the Spearman’s not; and as the number of outliers increase, both agree on a large correlation.

FIGURE 5

FIGURE 5. Effect of outliers on the correlation coefficient: 100 observations of two uncorrelated variables x and y ∼ N (40, 1) (observed Pearson’s correlation $r_{n}^{(x y)} = 0.1$ ) to which an outlier observation x_n+1, y_n+1, with increasing distance from the sample mean (open dots ○), is added. (A) The Pearson’s correlation $r_{n + 1}^{(x y)}$ calculated on the n + 1 observations (including the outlier x_n+1, y_n+1) increases with the distance of the outlier x_n+1, y_n+1 from the sample mean, consistently with Eq. 21. (B) The Spearman’s correlation coefficient is not sensitive to the distance of the outlier from the sample mean. However, this is not true if m ≥ 2 outliers are present, as shown in Figure 6.

FIGURE 6

FIGURE 6. (A) Effect of the number of outliers on the Pearson’s and Spearman’s correlation. Both indexes increase with the number m of outliers (B) Scatter plot of the Spearman’s index against the Pearson’s in presence of m = 1 outlier (black dots) and in presence of m = 100 outliers (purple dots). The difference between the two indexes is larger when m = 1 since the Spearman’s correlation is robust to the presence of that outlier. If the number of the outlier is large (m = 100) the difference between the two indexes becomes increasingly smaller since the Spearman index is also affected by the outliers: both indexes record an inflated correlation.

2.3 Violation of independence of observations

The third scenario RS3 pertains the violation of the assumption that the observations (i.e., the samples on which variables x and y are measured), are independent, i.e., independence of errors. Working with repeated measures is the most striking case of non-independent observations, since the same subject is measured more times. This scenario is graphically illustrated in Figure 7.

FIGURE 7

FIGURE 7. Graphical illustration of a data set containing m repeated (or replicated) measurements/observations of two variables x and y on n subjects. Each block of repeated measurements is depicted as a block of different color.

What are the consequences of the violation of the independence of the samples? This depends on the type of dependence present between the observations. Take for instance the case when technical replicates are used (wrongly!) to increase the sample size. In this case more observations (measurements) are available of x and y on the same sample(s), with the model (a similar equation holds for y)

x_{i, j} = μ_{x} + ϵ_{j}, (22)

where x_i,j is the jth replicate of observation ith of x and ϵ_j is the replication error. An example of n = 25 observations of x and y, each with three replicates, is shown in Figure 8A: if only one observation per subject is taken (n = 25), the correlation between x and y is r^(xy) = 0.91, while if all n + m = 25 + 2 × 25 = 25 × 3 = 75 observations are taken, the correlation is r^(xy) = 0.76. In general, using replicates considering them as independent observation will lower the value of the correlation coefficient. Repeated measures must be handled carefully with special approaches: there is ample literature on this topic, see for instance (Bakdash and Marusich, 2017) and reference therein.

FIGURE 8

FIGURE 8. (A) Scatter plot of n = 10 observations of two variables x and y (blue dots •) and of three replicates for each observation (red open circles ○). The sample correlation between the n = 10 observations is $r_{n}^{(x y)} = 0.91$ . (B) if the replicates are considered using a total of n + m = 30 observations, the sample correlation reduces to $r_{n + m}^{(x y)} = 0.76$ .

Dependence of observations can also arise because of reasons that are out of the control of the experimenter, like in presence of correlated measurement noise, where data can be modeled as

x_{i} = μ_{x} + ϵ_{x_{i}} + ϕ_{x} (23)

y_{i} = μ_{y} + ϵ_{y_{i}} + ϕ_{y}, (24)

where ϕ_x and ϕ_y are correlated error terms normally distributed with zero mean and given error variance-covariance. The presence of correlated error can induce correlation between two variables that are originally uncorrelated, as shown in Figure 8B. The effect of correlated and uncorrelated error on the Pearson correlation coefficient is discussed in (Saccenti et al., 2020).

Another commonly seen error is the estimation of the correlation using time series data to increase the sample size, which is another violation of the assumption of independent observations. This type of data also needs to be handled carefully, since it is very easy to obtain misleading results: a classic reference on this topic is (Yule, 1926).

3 Plot the data rather than blindly trusting correlation values

The examples discussed in the previous section should (hopefully) suggest that plotting the data is a critical step for the analysis, understanding and interpretation of correlations: visual exploration of scatter plots like those shown in Figures 2–4, can easily reveal the presence of outliers and data structures that can point to violation of the assumption of sampling from one population or independence of observations (although the latter can be tricky to spot). In the case of multivariate data when p ≫ 2 variables are measured, plotting and visual exploration of all possible correlation plots is usually not a feasible approach, since the number of plots increases as $\frac{1}{2} p (p - 1)$ . In this case, Principal Component Analysis (PCA) (Pearson, 1901; Hotelling, 1933; Jolliffe, 2002) is an extremely valuable tool since it can be used to reduce the dimensionality of high-dimensional data and can highlight the presence of outliers and of (unwanted) data structure that hampers the calculation of the correlation coefficient. An example is given in Figure 9, where the PCA plot of a simulated data set without and with data structure (in this case a batch effect) is shown.

FIGURE 9

FIGURE 9. Principal component analysis score plot of n = 200 observations of p = 1,000 variables. (A) Observations are sampled from the same normal multivariate distributions: variables are uncorrelated with unit variance and mean 40. There is no structure in the data and no outliers are visible. (B) Observations sampled in presence of batch effect affecting 500 variables which are sampled from a normal distribution with mean 40.2. The batch effect is clearly visible in the PCA score plot, suggesting a violation the assumption from one distribution. Caution must be taken when correlations are calculated!

4 Conclusion

In this technical note, I have shown some of the consequences of neglecting the assumptions of sampling from one population and independence of observations when calculating the Pearson’s correlation coefficient. I illustrated cases of the violation of these assumptions that originate when data sets coming from different experiments or pertaining different experimental conditions or in presence of batch effects are merged before the calculation of the correlation coefficient. It is shown that this way of proceeding will result in inflation or deflation of correlations: inflation or deflation of correlations. In both cases wrong inference will be made, leading to believe that a correlation exists when it does not, or that a correlation does not exist when it actually does. Similar problems arise when correlations are taken over repeated measures or time series.

The hope is that the reader, after having read and meditated the examples, will be able to recognize those situations where the calculation of the sample correlation is not allowed because of the violations of fundamental statistical assumptions.

Data availability statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Author contributions

ES is the sole author, having conceived, written, and edited this manuscript.

Funding

This study has received funding from Netherlands Organisation for Health Research and Development (ZonMW) through the PERMIT project (Personalized Medicine in Infections: from Systems Biomedicine and Immunometabolism to Precision Diagnosis and Stratification Permitting Individualized Therapies, project number 456008002) under the PerMed Joint Transnational call JTC 2018 (Research projects on personalised medicine—smart combination of pre-clinical and clinical research with data and ICT solutions).

Acknowledgments

Robert van Vorstenbosch is acknowledged for fruitful comments on the manuscript.

Conflict of interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Footnotes

¹The notation V ^(x) with the use of superscript (x) may seem, at fist, unpractical. Its utility will become evident in the remaining of the paper when more and different quantities are introduced.

References

Bakdash, J. Z., and Marusich, L. R. (2017). Repeated measures correlation. Front. Psychol. 8, 456. doi:10.3389/fpsyg.2017.00456

PubMed Abstract | CrossRef Full Text | Google Scholar

Calkins, D. S. (1974). Some effects of non-normal distribution shape on the magnitude of the pearson product moment correlation coefficient. Rev. Interam. Psicol. 8, 261–288.

Google Scholar

Chan, T. F., Golub, G. H., and LeVeque, R. J. (1982). “Updating formulae and a pairwise algorithm for computing sample variances,” in COMPSTAT 1982 5th symposium held at toulouse 1982 (Berlin: Springer), 30–41.

CrossRef Full Text | Google Scholar

Havlicek, L. L., and Peterson, N. L. (1977). Effect of the violation of assumptions upon significance levels of the pearson r. Psychol. Bull. 84, 373–377. doi:10.1037/0033-2909.84.2.373

CrossRef Full Text | Google Scholar

Havlicek, L. L., and Peterson, N. L. (1976). Robustness of the pearson correlation against violations of assumptions. Percept. Mot. Ski. 43, 1319–1334. doi:10.2466/pms.1976.43.3f.1319

CrossRef Full Text | Google Scholar

Hayes, K. (2012). Updating formulae for the sample covariance and correlation. Teach. Statistics Int. J. Teach. 34, 65–67. doi:10.1111/j.1467-9639.2011.00491.x

CrossRef Full Text | Google Scholar

Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24, 417–441. doi:10.1037/h0071325

CrossRef Full Text | Google Scholar

Janse, R. J., Hoekstra, T., Jager, K. J., Zoccali, C., Tripepi, G., Dekker, F. W., et al. (2021). Conducting correlation analysis: Important limitations and pitfalls. Clin. Kidney J. 14, 2332–2337. doi:10.1093/ckj/sfab085

PubMed Abstract | CrossRef Full Text | Google Scholar

Jolliffe, I. T. (2002). Principal component analysis. Berlin: Springer.

Google Scholar

Li, B., Tsoi, L. C., Swindell, W. R., Gudjonsson, J. E., Tejasvi, T., Johnston, A., et al. (2014). Transcriptome analysis of psoriasis in a large case–control sample: Rna-seq provides insights into disease mechanisms. J. Investigative Dermatology 134, 1828–1838. doi:10.1038/jid.2014.28

PubMed Abstract | CrossRef Full Text | Google Scholar

Motulsky, H. (2014). Intuitive biostatistics: A nonmathematical guide to statistical thinking. USA: Oxford University Press.

Google Scholar

Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin philosophical Mag. J. Sci. 2, 559–572. doi:10.1080/14786440109462720

CrossRef Full Text | Google Scholar

Pearson, K. (1895). Note on regression and inheritance in the case of two parents. Proc. R. Soc. Lond. 58, 240–242.

Google Scholar

Saccenti, E., Hendriks, M. H., and Smilde, A. K. (2020). Corruption of the pearson correlation coefficient by measurement error and its estimation, bias, and correction under different error models. Sci. Rep. 10, 438–519. doi:10.1038/s41598-019-57247-4

PubMed Abstract | CrossRef Full Text | Google Scholar

Schober, P., Boer, C., and Schwarte, L. A. (2018). Correlation coefficients: Appropriate use and interpretation. Anesth. Analgesia 126, 1763–1768. doi:10.1213/ANE.0000000000002864

PubMed Abstract | CrossRef Full Text | Google Scholar

Spearman, C. (1907). Demonstration of formulae for true measurement of correlation. Am. J. Psychol. 18, 161–169. doi:10.2307/1412408

CrossRef Full Text | Google Scholar

Spearman, C. (1904). Measurement of association, part ii. correction of ‘systematic deviations. Am. J. Psychol. 15, 88–101.

Google Scholar

Wilcox, R. R. (2009). Comparing pearson correlations: Dealing with heteroscedasticity and nonnormality. Commun. Statistics-Simulation Comput. 38, 2220–2234. doi:10.1080/03610910903289151

CrossRef Full Text | Google Scholar

Yule, G. U. (1926). Why do we sometimes get nonsense-correlations between time-series?–a study in sampling and the nature of time-series. J. R. Stat. Soc. 89, 1–63. doi:10.2307/2341482

CrossRef Full Text | Google Scholar

Keywords: covariance, data fusion, data merging, pearson’s correlation, repeated measures, spearman’s correlation, technical replicates, violation of statistical assumptions

Citation: Saccenti E (2023) What can go wrong when observations are not independently and identically distributed: A cautionary note on calculating correlations on combined data sets from different experiments or conditions. Front. Syst. Biol. 3:1042156. doi: 10.3389/fsysb.2023.1042156

Received: 12 September 2022; Accepted: 09 January 2023;
Published: 30 January 2023.

Edited by:

Kristin Tøndel, Norwegian University of Life Sciences, Norway

Reviewed by:

Marissa Renardy, Applied BioMath, United States
Abel Folch Fortuny, DSM, Netherlands

Copyright © 2023 Saccenti. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Edoardo Saccenti, ZWRvYXJkby5zYWNjZW50aUB3dXIubmw=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.