Data driven three-dimensional temperature and salinity anomaly reconstruction of the northwest Pacific Ocean

Chen, Yuanhong; Liu, Li; Chen, Xueen; Wei, Zhiqiang; Sun, Xiang; Yuan, Chunxin; Gao, Zhen

doi:10.3389/fmars.2023.1121334

ORIGINAL RESEARCH article

Front. Mar. Sci., 04 May 2023

Sec. Physical Oceanography

Volume 10 - 2023 | https://doi.org/10.3389/fmars.2023.1121334

Data driven three-dimensional temperature and salinity anomaly reconstruction of the northwest Pacific Ocean

Li Liu²

Zhiqiang Wei³

Zhen Gao^1*

¹School of Mathematical Sciences, Ocean University of China, Qingdao, China
²Frontier Science Center for Deep Ocean Multispheres and Earth System (FDOMES) and Physical Oceanography Laboratory, Ocean University of China, Qingdao, China
³College of Information Science and Engineering, Ocean University of China, Qingdao, China

By virtue of the rapid development of ocean observation technologies, tens of petabytes of data archives have been recorded, among which, the largest portion are those derived from the orbital satellites, embodying the character of ocean surface. Nevertheless, the insufficiency of information below the subsurface restricts the utilization of these data and the understanding of ocean dynamics. To circumvent these difficulties, we present the spatially three-dimensional reconstruction of ocean hydrographic profiles at depth based on the satellites and in-situ measurement data. In this manuscript, long short-term memory network (LSTM) and Gaussian process regression (GPR) methods are invoked to predict the temperature and salinity profiles in the northwest Pacific Ocean, and to improve computational and storage efficiency, the proper orthogonal decomposition (POD) method is subtly incorporated into these two models. LSTM and GPR show satisfactory results, with the root mean square error (RMSE) of temperature is less than 1.45, and the RMSE of salinity is less than 0.19. The incorporation of the POD method substantially accelerates efficiency, particularly in the LSTM model, which improves 7.5-fold without significant accuracy loss. The sensitivity of different sea surface parameters on the reconstructed profiles reveals that sea surface height anomaly and latitude significantly influence the reconstruction of temperature anomaly (TA) and salinity anomaly (SA) profiles. Besides, sea surface salinity and sea surface temperature anomalies can improve the model's estimation ability for the upper TAs and SAs, respectively. The contribution of monthly climatology to temperature and salinity profile estimation is also explored in this paper. It is shown that adding monthly mean climatology to the input of the model can achieve more accurate estimates.

1 Introduction

Ocean is an integral part of the global climate system and plays a crucial role in regulating climate change and balancing the Earth’s energy (Su et al., 2018). Knowledge of the vertical distribution of ocean temperature and salinity is significant for exploring the complex dynamical processes and ecosystems within the ocean (Rao and Sivakumar, 2003; Wilson and Coles, 2005; de Boyer Montégut et al., 2007; Helber et al., 2010; Meehl et al., 2011; Qin et al., 2015). However, the currently accumulated vertical temperature and salinity data is far from sufficient, and the problem of sparseness and discontinuity of the observed data due to the limitation of the number of observation points has severely limited the study of ocean processes and mechanisms (Klemas and Yan, 2014; Liu, 2016). Although the rapid development of satellite remote sensing technology has made it possible to provide more and more high-resolution, multi-scale and long-term continuous observation data, these data are limited to the ocean’s surface. They cannot provide spatial and temporal continuous information on the subsurface structure of the ocean (Ali et al., 2004; Wu et al., 2012; Bao et al., 2018). One fact is that the sea surface state is closely related to the subsurface features. According to the laws of ocean dynamics, many deep dynamic processes still generate signals at the sea surface, which satellite sensors can capture (Fiedler, 1988). Therefore, various underwater temperature and salinity reconstruction methods combining in situ observation data and satellite remote sensing data have been developed over the years.

One typical approach to reconstructing the internal structure from sea surface information is based on dynamics. Assimilation of observational data in numerical simulations (Ghil and Malanotte-Rizzoli, 1991; Troccoli and Haines, 1999; Vossepoel and Behringer, 2000; Carrassi et al., 2018; Moore et al., 2019) is a typical dynamics approach to inverse ocean subsurface information. However, this often requires many computing resources, and uncertainties in the initial and forced fields make the estimation accuracy impossible to guarantee (Robinson and Lermusiaux, 2000). Using a simplified dynamical framework can improve computational efficiency to some extent. Held et al. (Isern-Fontanet et al., 2006) proposes a method based on surface quasi-geostrophic (SQG) (Held et al., 1995) that derives subsurface density from sea surface density, sea surface height, and historical buoyancy frequency profiles. Further, Lapeyre and Klein (Lapeyre and Klein, 2006) developed an effective SQG (eSQG), assuming that the ocean interior’s potential vorticity (PV) is correlated with the surface density. eSQG was shown to be effective in improving subsurface flow field reconstructions (Isern-Fontanet et al., 2008; Qiu et al., 2016). Based on SQG, Wang et al. (Liu et al., 2017) proposed the internal + SQG (isQG) method, which superimposes the SQG mode with the positive and first oblique pressure modes to achieve subsurface density reconstruction by solving the surface quasi-geostrophic equation and the internal equation. The method’s effectiveness has been verified in different studies (Liu et al., 2014; Liu et al., 2017). However, due to the simplified model, some complex dynamical processes in the ocean are neglected (Liu et al., 2019; Meng and Yan, 2022), and the methods based on SQG can directly invert the density field, which will introduce additional errors when reconstructing the temperature and salinity fields (Chen et al., 2020).

Statistical methods are also widely used to reconstruct the three-dimensional structural field of the ocean. In earlier studies, linear regression (Willis et al., 2003; Nardelli and Santoleri, 2004; Guinehut et al., 2012) and least squares regression (Carnes et al., 1990; Carnes et al., 1994) were used to estimate deep ocean information. Besides, methods based on empirical orthogonal functions (EOF) (Maes et al., 2000; Meinen and Watts, 2000; Buongiorno Nardelli and Santoleri, 2005; Yan et al., 2020) are widely used to reconstruct the subsurface vertical structure. These methods use EOF to decompose the ocean vertical state vector, retain a few major modes, and then use least-squares regression or variational method (Fujii and Kamachi, 2003a; Fujii and Kamachi, 2003b) to solve for the objective function controlled by the sea surface information and the major modes. With the development of artificial intelligence techniques and machine learning methods, an increasing number of studies are focusing on the potential of machine learning methods in three-dimensional temperature and salinity field reconstruction. These methods can effectively mine the intrinsic patterns between data and estimate the structure of physical quantities in the ocean interior from sea surface parameters. Currently, self-organization mapping (Wu et al., 2012; Chen et al., 2018), support vector machine regression (Su et al., 2015; Li et al., 2017), random forest (Su et al., 2018), and neural network-based methods (Ali et al., 2004; Ballabrera-Poy et al., 2009; Bao et al., 2018; Lu et al., 2019; Buongiorno Nardelli, 2020; Su et al., 2020; Su et al., 2021) have been applied to estimate three-dimensional thermohaline fields. The results show that the machine learning methods can achieve better reconstruction based on a large amount of observation data and have strong generalizability.

In this context, this paper applies several different regression methods to estimate the subsurface temperature anomaly (TA) and subsurface salinity anomaly (SA). The first method used is Gaussian process regression (GPR) (Williams and Rasmussen, 1995; Rasmussen and Williams, 2005), an effective tool widely used in complex real-world problems (Stein, 1999; Forrester et al., 2008; Nguyen and Peraire, 2016). GPR is flexible enough to obtain estimates of unknown quantities using simple matrix operations and often achieves reliable accuracy on small data sets. More critically, it can effectively measure the uncertainty in the prediction because it gives the distribution of the predicted values (Rasmussen and Williams, 2005). The second one is the long short-term memory network (LSTM) (Hochreiter and Schmidhuber, 1997), a deep learning algorithm that can learn long-time dependencies (Sak et al., 2014; Wan et al., 2018; Yeo and Melnyk, 2019). However, training LSTM is often computationally expensive (Masuko, 2017), which is more fully reflected in ocean applications with ”big data” characteristics. Therefore, to reduce the training time and save storage costs, we further propose LSTM-POD and GPR-POD to predict the vertical distribution of TA and SA by introducing POD (Liang et al., 2002), a widely used tool for reduced order modeling (Lucia et al., 2004; Quarteroni et al., 2015). Specifically, POD can achieve simplification and dimensionality reduction of the dataset by identifying the few main modes that represent the three-dimensional temperature and salinity fields with relative precision. And then the TA and SA profiles can be approximated as linear combinations of these modes, we only need to learn the relationship between the coefficients and the input parameters by LSTM and GPR, which greatly simplifies the regression model. Therefore, the first goal of this paper is to explore the reconstruction accuracy of different models and the degree of improvement of the POD on computational efficiency. The reliability and computational efficiency of the proposed methods is verified by calculating the root mean square error (RMSE) between the estimated TA and SA profiles and the Argo thermohaline anomaly profiles, as well as comparing the CPU time of different methods. In addition, we point out that it is only necessary to interpolate the modes about the depth, and the linear combination of the interpolated modes according to the predicted coefficients can be used to obtain TA and SA estimates for any depth.

In particular, the effects of different parameters on TA and SA estimates are also investigated by comparing the prediction accuracy of models with various input parameters and calculating the correlation coefficients between input parameters and the temperature and salinity anomalies. In addition, this paper evaluates the potential application of climatology in temperature and salinity reconstruction using different combinations of temperature and salinity climatology data.

This paper is organized as follows. The study area and the data we used are presented in Section 2. Section 3 gives an overview of the methodology used in this paper. Section 4 is devoted to the description of the results, we draw conclusions in section 5.

2 Data

The goal of this paper is to reconstruct the three-dimensional thermohaline structure of the Northwest Pacific Ocean (95° W-135° W, 5° S-45° N) for the period 2011-2021. The sources of ocean observations used in this study are: Argo data, satellite sea surface temperature (SST), satellite sea surface salinity (SSS), and sea surface height anomaly (SSHA), which are described below. The climatology of the World Ocean Atlas 2018 (WOA18), used as the monthly mean of climatology is also presented.

2.1 Argo data

The Argo profiles are obtained from Global Sea Ocean Argo Scatter Dataset (V3.0) (Liu et al., 2021) provided by the China Argo Real-time Data Center (ftp://ftp.argo.org.cn/pub/ARGO/global/). This dataset collects more than 2.3 million temperature and salinity depth profiles observed by over 15,000 automated profiling buoys deployed in the global ocean by international Argo member countries during July 2000 through June 2020. Data inside the region of 95° W-135° W and 5° S-45° N for the period 2011-2021 are considered here. Profiles with depths exceeding 700 m are selected, and the profiles are interpolated through a spline into the same 71 vertical levels extending from the surface (5 m) to 705 m (vertical step is 10 m).

2.2 Satellite SST data

The SST data of this study are created by the OSTIA (Operational SST and Ice Analysis) system, using re-processed ESA SST CCI, C3S EUMETSAT and REMSS satellite data and in situ data from the HadIOD dataset, distributed through the Copernicus Marine Environment Monitoring Service (CMEMS, http://marine.copernicus.eu/services-portfolio/access-to-products/, product_id = SST_GLO_SST_L4_REP_OBSERVATIONS_010_011). This product provides daily maps of the SST and SST uncertainty on a global regular grid at 0.05° resolution, which are stored using the netCDF format using the Group for High Resolution SST specification.

2.3 Satellite SSS data

The SSS data is a Level 4 products on a 0.25 degree spatial and 4-day temporal grid produced by the International Pacific Research Center (IPRC) at the University of Hawaii at Manoa in collaboration with the Santa Rosa Remote Sensing System (RSS) in California in conjunction with observations from NASA’s Aquarius/SAC-D and Soil Moisture Active (SMAP) satellite missions. The product is a continuous, consistent multi-satellite SSS data obtained by optimal interpolation with a 7-day decorrelation time scale (Melnichenko et al., 2016). Their mean root mean squared difference from globally synchronized in situ data is about 0.19 psu and the product bias is about zero.

2.4 Satellite SSHA data

The altimeter sea level anomalies with daily and 0.25°×0.25° resolutions are provided by Sea Level TAC (Thematic Assembly Centre, https://resources.marine.copernicus.eu/product-detail/SEALEVEL_GLO_PHY_L4_MY_008_047/DATA-ACCESS). The data produced in the frame of this TAC are generated by the processing system including data from all altimeter Copernicus missions (Sentinel-6A, Sentinel-3A/B) and other collaborative or opportunity missions (e.g.: Jason-3, Saral[-DP]/AltiKa, Cryosat-2, OSTM/Jason-2, Jason-1, Topex/Poseidon, Envisat, GFO, ERS-1/2, Haiyang-2A/B/C).

2.5 Climatology data

The climate data used in this study are from World Ocean Atlas 2018 (Boyer et al., 2018) (https://www.ncei.noaa.gov/products/world-ocean-atlas), which is provided by the National Oceanographic Data Center (now the National Centers for Environmental Information - NCEI)(Boyer et al., 2018) (https://www.ncei.noaa.gov/products/world-ocean-atlas). The atlas is an objectively analyzed, quality-controlled collection of temperature, salinity, oxygen, phosphate, silicate, and nitrate averages based on profile data from the World Ocean Database and distributed online by NCEI. Monthly climatology fields of temperature and salinity at standard depth levels at a spatial resolution of 0.25°×0.25° were used in this study and interpolated by cubic spline onto regularly spaced vertical grids (10 m apart).

Note that all satellite data and climatology data are interpolated to the same spatial distribution as the Argo observations by bilinear interpolation in the present study. And anomalies are defined as the observation data (SST,SSS,Argo) subtracted by the monthly WOA18 data.

3 Method

3.1 Gaussian process regression

In this section, GPR is used to estimate the vertical profiles of TA and SA. The SAs and TAs in each level are considered as a collection of some random variables and obey a joint Gaussian distribution, defined as $y = f (x) + ϵ$ , where $x$ is input vector of predicted parameters, which are sea surface temperature anomaly (SSTA), sea surface salinity anomaly (SSSA), sea surface height anomaly(SSHA), longitude (LON), latitude(LAT) and the day of the year projected on a circle (JULD). The prior distribution of $f (x)$ is assumed to be a GP given by $f (x) \sim G P (0, κ (x, x))$ , and $κ$ is the semi-positive definite kernel. $ϵ \sim G P (0, χ^{2})$ denotes the Gaussian noise term, here $χ$ is the standard deviation. Based on the historical Argo profiles and remote sensing data, we can collect $n$ TA observations or SA observations at depth $z$ to form the observation set $y = {y_{1}, y_{2}, \dots, y_{n}}$ . Corresponding to each observation, we have also collected a set of input parameters $X_{g} = [x_{1} ∣ x_{2} | \dots | x_{n}] \in ℝ^{d \times n}$ , here $x_{i} = (S S T A_{i}, S S S A_{i}, S S H A_{i}, L O N_{i}, L A T_{i},$ Then the prior distribution of $y$ can be given as:

\begin{array}{l} y | X_{g} \sim N (0, K_{y}), K_{y} = Cov [y ∣ X_{g}] = κ (X_{g}, X_{g}) + χ^{2} I_{n} . & (1) \end{array}

GPR estimates the posterior distribution of an unknown quantity under the assumptions of a Gaussian process and the likelihood of a normal distribution. Specifically, from the above assumptions, it is known that the joint probability distribution of the estimated value $f_{*}$ at the new input parameter $x_{*}$ and the existing observation $y$ is a joint normal distribution of the following form:

\begin{array}{l} [\begin{matrix} y \\ f_{*} \end{matrix}] \sim N (0, [\begin{matrix} K_{y} & K_{*} \\ K_{*}^{T} & K_{* *} \end{matrix}]), & (2) \end{array}

where $K_{*} = κ (X_{g}, x_{*})$ , $K_{* *} = κ (x_{*}, x_{*})$ . Combining the prior assumption and the likelihood, the posterior distribution of $f_{*}$ can be derived from the maximum likelihood method (Williams and Rasmussen, 1995),

\begin{array}{l} \begin{array}{l} f_{*} ∣ x_{*}, X_{g}, y \sim G P (m^{*}, C^{*}), \\ m^{*} = K_{*}^{T} K_{y}^{- 1} y, C^{*} = K_{* *} - K_{*}^{T} K_{y}^{- 1} K_{*} . \end{array} & (3) \end{array}

The reconstruction of TA and SA profiles using GPR is carried out in the python library (GPy, 2012), and the output is a vector consisting of the TA profile and SA profile in series at the same location. The kernel function chosen in this paper is the radial basis function. The minimum and maximum of the input/output can be utilized to scale the input/output to [0,1] to eliminate the magnitude effect before the model is trained.

In summary, the derivation of the vertical structure of TA and SA using GPR involves two processes: online and offline stages. In the offline stage, historical observations are collected to build the corresponding input/output training set, from which the GPR is trained to learn the mapping of the input parameters to the TA and SA profiles. In the online stage, the corresponding TA and SA profiles are estimated from the already trained GPR for the new input parameters. The flowchart of TA and SA estimation using the GPR is shown in Figure 1.

FIGURE 1

Figure 1 Flowchart of salinity and temperature estimation using GPR.

3.2 Long short-term memory network

LSTM is an extension of the traditional recurrent neural network (RNN), which is mainly used to deal with the case of traditional RNN failure. It solves the problem of gradient disappearance and gradient explosion of traditional RNNs to a certain extent to learn long-term dependent information. Same as general RNN, it also consists of a series of repeating cells with sequential connections, but the structure of the cells is more complex. Crucially, the LSTM adds a state vector updated over time to the cells, which can selectively record the information of the system and preserve the long-term state of the system. Each cell has three inputs, the input $x_{i}$ of the network at the current time, the output value $h_{i - 1}$ at the previous time, and the cell state $C_{i - 1}$ at the last time. The input variables are passed through three gates with different functions, namely forget gate, input gate and output gate, to forget and add information to the cell state C_i at the current moment and update the output state $h_{i}$ at the current moment. The structure of each cell is shown in Figure 2 (Hochreiter and Schmidhuber, 1997), and the tasks of the different gates are implemented using the following equations (Hochreiter and Schmidhuber, 1996; Gers et al., 2000):

FIGURE 2

Figure 2 Structure of single cell.

\begin{array}{l} \begin{array}{l} f_{i} = σ (W_{f} [h_{i - 1}, x_{i}] + b_{f}), I_{i} = σ (W_{I} [h_{i - 1}, x_{i}] + b_{I}), \\ {\tilde{C}}_{i} = \tanh (W_{C} [h_{i - 1}, x_{i}] + b_{C}), O_{i} = σ (W_{O} [h_{i - 1}, x_{i}] + b_{O}), \\ C_{i} = f_{i} * C_{i - 1} + I_{i} * {\tilde{C}}_{i}, h_{i} = O_{i} * \tanh (C_{i}), \end{array} & (4) \end{array}

where $σ$ denotes the sigmoid activation function, $W_{f}, W_{I}, W_{C}, W_{O}$ are weight matrices, $b_{f}, b_{I}, b_{C}, b_{O}$ are model biases, and $*$ represents dot product operation.

Referring to (Buongiorno Nardelli, 2020), we also use LSTM to estimate the TA and SA profiles. TAs and SAs from different depths are considered the output states of different cells. The input to each cell is the same, a multivariate vector consisting of SSTA, SSSA, SSHA, LON, LAT, and JULD corresponding to the current position. The structure of the LSTM we used is the same as that in (Buongiorno Nardelli, 2020), i.e., a 2-layer stacked network. Each layer contains 35 hidden units, and the optimization algorithm is Adam (Kingma and Ba, 2014). Similarly, we also use max−min normalization to preprocess the data before the LSTM training.

3.3 Proper orthogonal decomposition

POD, which has a wide range of applications in various fields (Liang et al., 2002; Pinnau, 2008; Chapelle et al., 2012; Singler, 2014), provides a means to obtain a low-dimensional description of the system. This method extracts a small number of modes from historical TA and SA profiles that can represent the main features of the field, which simplifies and reduces the dimensionality of the data. The temperature-salinity anomaly profiles at the same location are placed in a multivariate matrix X

X = [\begin{matrix} T_{11} & T_{12} & \dots & T_{1 n} \\ T_{21} & T_{22} & \dots & T_{2 n} \\ \dots & \dots & \dots & \dots \\ T_{m 1} & T_{m 2} & \dots & T_{m n} \\ S_{11} & S_{12} & \dots & S_{1 n} \\ S_{21} & S_{21} & \dots & S_{2 n} \\ \dots & \dots & \dots & \dots \\ S_{m 1} & S_{m 2} & \dots & S_{m n} \end{matrix}] \overset{Δ}{=} [u_{1} | u_{2} | \dots | u_{n}] \in ℝ^{2 m \times n},

where $m$ is the number of vertical levels, $n$ is the number of vertical profiles. In order to calculate the $k$ modes, we only need to do the singular value decomposition of $X$ as follows:

X = U Σ Z^{T},

where $U \in ℝ^{2 m \times r}$ and $Z \in ℝ^{n \times r}$ are orthogonal matrices. The matrix $Σ = d i a g (σ_{1}, σ_{2}, \dots, σ_{r})$ contains the singular value $σ_{1} \geq σ_{2} \geq \dots \geq σ_{r}$ , where $r$ is the rank of $X$ . The first $k$ columns of $U$ are the $k$ modes to be computed, which we denote as

U_{k} = U [:, 1 : k] = [\begin{matrix} L_{1} \\ M_{1} \end{matrix} \begin{matrix} \dots \\ \dots \end{matrix} \begin{matrix} L_{k} \\ M_{k} \end{matrix}] .

The dimensions of $L_{i} = {[L_{1 i}, L_{2 i}, \dots, L_{m i}]}^{⊤}$ and $M_{i} = {[M_{1 i}, M_{2 i}, \dots, M_{m i}]}^{⊤}$ is $m \times 1$ . Then the corresponding TA profile can be represented by a linear combination of $L_{i} (i = 1, 2, \dots, k)$ , and the SA profile is represented by a linear combination of $M_{i} (i = 1, 2, \dots, k)$ , where the unknown coefficients are shared. A common method for solving the coefficients is to solve a linear system that is obtained by equating the corresponding sea surface elements inthe reconstructed anomaly profiles to sea surface observations. In this case, $k = 2$ is required. To increase $k$ , other physical quantities of the ocean need to be added to $X$ to impose constraints on the combined coefficients.

Different from the above methods, in this paper, two methods, GPR and LSTM, are utilized to learn the relationship from the sea surface parameters to the coefficients. In particular, $U_{k}$ is the solution of the following minimization problem

\begin{array}{l} \min_{W ϵ Y_{k}} {‖ X - W W^{T} X ‖}_{F}^{2}, & (5) \end{array}

where $Y_{k} = {W \in ℝ^{2 m \times k} : W^{T} W = I_{k}}$ , ${‖ \cdot ‖}_{F}$ is the Frobenius norm, and the error is estimated as

\begin{array}{l} {‖ X - U_{k} U_{k}^{T} X ‖}_{F}^{2} = \sum_{i = k + 1}^{r} σ_{i}^{2} . & (6) \end{array}

From this, we can naturally compute the coefficient vector $α_{i} = U_{k}^{T} u_{i} \overset{Δ}{=} {[α_{i}^{1}, α_{i}^{2}, \dots, α_{i}^{k}]}^{⊤}$ corresponding to the historical temperature and salinity anomaly profile $u_{i}$ . And, we can determine $k$ by the following equation according to the required accuracy

\begin{array}{l} \frac{\sum_{i = 1}^{k} σ_{i}^{2}}{\sum_{i = 1}^{r} σ_{i}^{2}} \geq 1 - δ, & (7) \end{array}

where $δ$ is user defined cut-off threshold. To retrieve the vertical structure of TA and SA from the sea surface parameters, we collect the input parameters $x_{i}$ corresponding to the obtained state vector

$u_{i} (i = 1, 2, \dots, n)$ . Here $x_{i}$ is still a vector consisting of SSTA, SSSA, SSHA, LON, LAT and JULD. Then based on the collected dataset $D_{t r} = {(x_{1}, α_{1}), (x_{2}, α_{2}), \dots, (x_{n}, α_{n})}$ , we can train GPR or LSTM to build the mapping of input parameters $x$ to coefficients $α$ , then the TA and SA are reconstructed as (Hesthaven et al., 2016; Guo and Hesthaven, 2019)

\begin{array}{l} T (x) = \sum_{i = 1}^{k} α^{i} (x) L_{i}, S (x) = \sum_{i = 1}^{k} α^{i} (x) M_{i} . \end{array}

The corresponding flow chart is given in Figure 3.

FIGURE 3

Figure 3 Flowchart of salinity and temperature estimation using GPR-POD and LSTM-POD.

Further, we point out that $L_{i}$ and $M_{i}$ are actually functions of depth $z$ . To estimate the TA and SA at any depth $z^{*}$ , we use cubic spline interpolation to interpolate $L_{i}$ and $M_{i}$ with respect to depth $z$ to obtain the corresponding interpolation functions ${\tilde{L}}_{i} (z)$ and ${\tilde{M}}_{i} (z)$ satisfying ${\tilde{L}}_{i} (z_{j}) = L_{j i}$ and ${\tilde{M}}_{i} (z_{j}) = M_{j i} (i = 1, 2, \dots, k, j = 1, 2, \dots,, m)$ , where $z_{i} (i = 1, 2, \dots, m)$ denotes the vertical levels, so that the TA and SA at depth $z^{*}$ can be reconstructed as

\begin{array}{l} T (z^{*}, x) = \sum_{i = 1}^{k} α^{i} (x) {\tilde{L}}_{i} (z^{*}), S (z^{*}, x) = \sum_{i = 1}^{k} α^{i} (x) {\tilde{M}}_{i} (z^{*}) . \end{array}

4 Results

4.1 Comparison between different models

To evaluate the performance of different models, we randomly selected 20% of the 11,374 Argo profiles as the test set and the remaining 80% as the training set. Different models are then employed to retrieve TA and SA profiles with the same inputs, i.e., SSTA, SSSA, SSHA, LON, LAT, and JULD. The RMSE between TA and SA profiles obtained from these models and the Argo profiles is calculated to evaluate the performance of the different methods. The RMSE of the WOA is obtained by comparing the temperature and salinity profiles obtained after interpolating the monthly mean of the climatology of WOA18 with the Argo profiles. Let δ=0.04% and we choose k=14 modes, in fact, before performing the POD, we similarly scaled the data between 0 and 1. Figure 4 shows the vertical distribution of the RMSEs estimated by the different models. The RMSEs of all proposed methods are smaller than the RMSE of the WOA18, and a similar vertical structure can be seen in the different models. The RMSEs are small at the sea surface but increase rapidly with depth, decrease rapidly after reaching the maximum, and stabilized. This may be related to the complex dynamical processes in the ocean’s upper layers and the perturbations in the mixed and thermocline layers, while the seawater is relatively stable in the deeper layers. The RMSEs of temperature reach their maximum at 105 - 115 m, and only near this depth (where the temperature variation is relatively large), LSTM shows superior performance in predicting TA profiles compared to other methods. This suggests that LSTM may have greater potential for approximating strongly nonlinear functions since, at depths where temperature changes rapidly, the relationship between temperature and depth is more complex and may have stronger nonlinearity.

FIGURE 4

Figure 4 Estimated RMSEs of different models. (A) RMSEs of estimated temperature. (B) RMSEs of estimated salinity.

However, the performance of the four methods is comparable from an overall perspective. The RMSEs of salinity reach their maximum between 55 - 65 m. Although the four reconstruction methods have comparable accuracy, GPR performs a little better than LSTM in the reconstruction of upper salinity, which may arise because the rate of change of the salinity profile is not as large as that of the temperature profile, GPR can produce sufficiently accurate approximations, while LSTM has a more complex structure and numerous parameters, making them reliant on large amounts of training data to obtain more accurate predictions. In addition, the POD combined with the regression methods does not cause a large loss in the accuracy of the estimated profiles, which is better reflected in the RMSEs of GPR and GPR-POD, as the RMSEs of both are almost the same. The selected modes are sufficient to characterize the vast majority of the temperature and salinity fields. To further measure the uncertainty of the estimates of GPR-based methods, Figure 5 shows the standard deviation of the posterior distribution of the GPR and GPR-POD predictions. We can see that both methods display similar distribution patterns, with comparable uncertainty estimates at most points, but the uncertainty of the model prediction is higher at the western boundary of the region, which can be attributed to the limited training data available in this region. This also shows a drawback of the GPR-based approach, which is better suited for interpolated predictions rather than extrapolation. Notably, the predictions of GPR-POD exhibit higher levels of uncertainty, despite comparable RMSEs between GPR-POD and GPR. This discrepancy may be attributed to the fact that more information is available in the training data for GPR, making it more confident in its predictions.

FIGURE 5

Figure 5 Uncertainty estimation of (A) GPR and (B) GPR-POD.

Considering the inhomogeneity and sparsity of spatio-temporal distribution of thermohaline data, all the thermohaline relationships in the test set are drawn in the same graph (Figure 6), the purpose of which is to compare the predictions of different models, in which the different colors of points are to better distinguish the differences between points. The main distribution structure of T-S graph reconstructed by all methods is similar to that of Argo field, but the distribution range of points is more concentrated than that of Argo field. From the T-S graphs, it can be observed that the reconstructed results are generally weaker for the points with larger absolute values of salinity anomalies. To better visualize the predicted distribution of temperature and salinity anomalies, a histogram of the number of salinity anomalies in different intervals is displayed at the top of the T-S graph, and a histogram of temperature anomalies is displayed on the right side of the T-S graph. The histogram of salinity anomalies reveals that the distribution of LSTM-POD is more consistent with that of Argo, while LSTM underestimates the number of points of salinity anomalies in the interval [-0.5,0], which is overestimated by GPR and GPR-POD. The histograms of temperature anomalies from different models indicate that all four methods overestimate the number of temperature anomalies in the interval [0,2.5], but their overall distribution is very similar to that of Argo. In conclusion, the reconstructed methods can predict the main distribution structure of the T-S graph. Although there are slight differences in the distribution range and there may be problems of underestimating the thermohaline state, the performance of the proposed reconstructed methods is generally satisfactory. Furthermore, Figure 7 displays the scatter plots of the LSTM estimated and Argo’s temperature and salinity anomalies at a depth of 105 m. This depth is significantly impacted by the variation of the upper mixed layer, and the RMSE estimated for temperature and salinity at this depth is also large. The results indicate that the overall discrepancy between them is minimal, as the LSTM prediction preserves the fundamental characteristics of the temperature and salinity anomalies, and its pattern (positive or negative) is accurately reconstructed.

FIGURE 6

Figure 6 T-S graph and histogram of temperature and salinity anomalies for different models. (A) T-S graph for Agro (left and bottom), histogram of the number of salinity anomalies in different intervals for Agro (top), histogram of the number of temperature anomalies in different intervals for Agro (right). Panels (B-E) are the same as panels (A), respectively, except for (B) LSTM, for (C) GPR, for (D) LSTM-POD and for (E) GPR-POD.

FIGURE 7

Figure 7 TAs and SAs of different models at 105 m. (A) LSTM-estimated TAs. (B) Argo TAs. (C) LSTM-estimated SAs. (D) Argo SAs.

To further compare the improvement of POD on the computational efficiency of different models, we compare the running times of different models. The CPU times of LSTM and LSTM-POD are 11216 s and 1488 s, respectively, computed using a laptop with 4 Intel(R) Core(TM) i7-6770 CPU @ 3.40GHz and 8G Memory. Since GPR requires higher memory for matrix operations, GPR and GPR-POD are run using a single node with a single core in Tianhe-2, which adopts an Intel Xeon E5-2692 v2 CPU @ 2.20GHz and 64GB Memory. Furthermore, the running times of GPR and GPR-POD are 891 s and 788 s, respectively. We can see that the training time of LSTM can be significantly reduced by combining POD, as LSTM takes 7.5 times longer than LSTM-POD. However, this reduction is not apparent in GPR, this is attributed to the fact that the computational complexity of GPR-POD is $Ο (k n^{3})$ , while the computational complexity of GPR is $Ο (2 m n^{3})$ . In this case, n=9099,k=14,m=71, the improvement in training time is insignificant when the number of training samples is much larger than the number of output dimensions.

4.2 Sensitivity of different parameters in GPR-POD

To test the sensitivity of the estimated TA and SA to different input parameters, we compare GPR-POD estimates with different training inputs. Figure 8. shows the RMSEs of temperature with various inputs, and as well as the correlation between test input parameters, the correlation between test input parameters, and test TA profiles , and the correlation between test input parameters and test SA profiles. By observing the RMSEs of temperature, we can find that the RMSE of the model without SSHA in the input is large at all depths, which indicates that SSHA plays a vital role in ocean motion and processes. In fact, changes in the subsurface layer may give rise to changes in SSH resulting from the interaction of several factors, such as heat exchange, internal thermal expansion, and ocean circulation. A rise in sea temperature leads to an increase in SSH, while a decrease in sea temperature leads to a reduction in SSH. The correlation coefficients between SSHA and TA profiles also affirm the excellent association between SSHA and TA. Notably, the correlation coefficient between SSHA and TA is the highest among all input parameters except SSTA. In addition, SSH changes under the influence of wave and wind shear. Incorporating SSHA into the model input can provide more information for predicting TA. Latitude, longitude, and time all improve the TA profile reconstruction to different degrees, showing that geographic and temporal information is helpful in predicting TA. In particular, incorporating latitude data yields a more substantial reduction in RMSE than incorporating longitude data, suggesting that latitude has a greater impact on TA reconstruction. This is because the discrepancy in solar radiation across different latitudes results in significant variations in sea temperature in the north-south direction. Therefore, latitude information is essential to improve the performance of subsurface reconstruction. Furthermore, the correlation between latitude and TA profiles is higher compared to that between longitude and TA profiles, highlighting the importance of considering latitude when predicting TA. The model without SSTA in the input has the largest RMSE in the upper layer (<100 m ), but when the depth is greater than 100 m, there is almost no difference between the model without SSTA and GPR-POD ( all parameters as input). This suggests that SSTA data mainly improve the reconstruction of upper layer TA, while deeper temperature variations are difficult to be interpreted from satellite measurements. In addition to the fact that SSSA has no significant relationship with ocean interior temperature, which can also be verified by observing the correlation between SSSA and TA profiles. Similar results can be observed in the RMSEs of salinity, where adding both latitude and SSHA data reduces the RMSEs of salinity at all depths. The relationship between ocean salinity and temperature on density is a fundamental aspect that affects changes in SSH, causing SSHA and SA profiles to have some correlation. Latitude also influences changes in ocean salinity as a result of a combination of factors, including precipitation, evaporation, and mixing of water masses. Consequently, incorporating latitude and SSHA into the prediction of subsurface salinity can significantly improve model accuracy. SSSA data play an important role in the reconstruction of upper ocean SA profiles, while they do not help much in the reconstruction of deep SA profiles, and the correlation between SSSA and SA gradually decreased with increasing depth. The RMSEs of several other models show that longitude, time, and SSTA have somewhat limited improvements on the models, where longitude and time can reduce the maximum of salinity RMSE, while SSTA has no significant improvement on the reconstruction of SA profiles.

FIGURE 8

Figure 8 (A) RMSEs of temperature from models with different sea surface inputs. (B) RMSEs of salinity from models with different sea surface inputs. (C) heat map of correlation coefficients.

4.3 The performance of climatology on three-dimensional salinity and temperature fields reconstruction

To test the influence of climatology on the temperature and salinity reconstruction models, we considered three different inputs and outputs for LSTM and GPR. The inputs are 1) X: SSTA, SSSA, SSHA, LAT, LON, JULD; 2) X-Without-WOA: SST, SSS, SSHA, LAT, LON, JULD, and 3) X-WOA: SST, SSS, SSHA, LAT, LON, JULD, SS (climatology), ST (climatology) and the outputs are 1) X: STA, SSA; 2) X-Without-WOA: SS, ST, and 3) X-WOA: SS, ST. The RMSEs estimated by different models are shown in Figure 9. Observing the RMSEs of the two regression models with various inputs and outputs, we can find a similar behavior, especially when the temperature at depths of less than 100 m and the salinity at depths of less than 200 m, training the models with the temperature and salinity anomaly fields can significantly improve the prediction accuracy than directly using temperature and salinity as the input and output of the model. This is because the seasonal variation signal disappears after calculating the anomaly field using the monthly mean of climatology, thus making the SS and SA easier to predict. On the other hand, incorporating monthly temperature and salinity climatology fields into the model input gives the most accurate estimates. Climatological information added to the input can effectively reduce the impact of the sea surface parameter errors. In summary, seasonal variations in seawater temperature and salinity play a crucial role in oceanic processes. The information provided by seasonal variations in temperature and salinity can help us better estimate temperature and salinity structure within the ocean.

FIGURE 9

Figure 9 The RMSEs of temperature and salinity from models with different inputs and outputs. (A) RMSEs of estimated temperature. (B) RMSEs of estimated salinity.

Comparing LSTM and GPR with the same input and output, it can be found that the accuracy of GPR with temperature and salinity climatology as input is higher than that of LSTM. This is due to the inclusion of monthly climate fields of temperature and salinity in the input, which makes the model less complex. Similar to the findings in Section 4.1, GPR can provide more accurate predictions when the amount of training data is not abundant. In contrast, the performance of GPR without considering climatology is the worst, expressly, by up to 20% (∼245 m) over the GPR-WOA estimate. When only sea surface parameters are included in the input without climatology as an aid, it leads to a more complex relationship between the input and output in the model. The LSTM, on the other hand, can better approximate the strongly nonlinear function, so the GPR-Without-WOA does not perform as well as the LSTM-Without-WOA.

4.4 Estimation of temperature and salinity using continuous remote sensing data from SSHA and SSTA

Since the SSS data is the 4-day temporal grid, in order to make full use of the other daily satellite and in situ observations, we use the SSTA, SSHA, LAT, LON, and JULD as model inputs in this section to train the LSTM model (named LSTM-Daily) to estimate the TA and SA profiles. Thus, there are 45,031 Argo observation profiles available, and again, 80% are randomly selected as the training set, and the remaining 20% are used to test the model’s performance. The RMSEs of the temperature and salinity estimated by LSTM-Daily are shown in Figure 10, respectively, besides the RMSE reduction rate of LSTM relative to climatological estimates also shown in Figure 10. It is clear that the RMSE of the estimated TA profiles is reduced by 20%−70% compared to climatology. By increasing the training data size, a more accurate reconstruction of the temperature profiles can be obtained. On the other hand, the prediction accuracy decreases for salinity at depths less than 100 due to the lack of SSS information, which again validates the importance of SSSA for SA reconstruction at a depth of less than 100 m. However, when the depth is greater than 100 m, as the effect of SSSA data disappears, higher accuracy SA estimation can be achieved again due to increased training data size.

FIGURE 10

Figure 10 The RMSEs of estimated temperature and salinity, and the reduction of RMSE for LSTM reconstructed profiles relative to climatological profiles. (A) RMSEs of estimated temperature. (B) RMSEs of estimated salinity. (C) The reduction of temperature’s RMSE. (D) The reduction of salinity’s RMSE.

5 Discussion

This paper applies several methods to estimate the TA and SA profiles from sea surface parameters, namely LSTM, GPR, LSTM-POD, and GPR-POD. LSTM and GPR directly train the model with TAs and SAs at different depths as the output. At the same time, LSTM-POD and GPR-POD combine LSTM, GPR respectively, with POD, which can downscale the data, and it assumes that only a small number of modes are needed to represent the main features of the temperature and salinity anomaly fields. Thus, only the model needs to be trained to estimate the coefficients, i.e., the output of the model is the reduced coefficients corresponding to the temperature-salinity anomaly profiles, which greatly simplifies the regression model. In addition, using the predicted coefficients, a linear combination of the interpolated modes can estimate TA and SA at any depth. We selected Argo observations and satellite remote sensing data for the Northwest Pacific Ocean in 2011-2021 for training and testing. The accuracy and reliability of the proposed methods are evaluated by calculating the RMSEs of the estimated profiles, and the results show that these methods can accurately derive the TA and SA profiles, and the introduction of POD greatly saves time and storage costs without additional loss of accuracy, especially for LSTM.

In order to determine the relative importance of different input parameters to the temperature and salinity reconstruction, we evaluate the results of several models with different inputs, in addition to which correlations between different parameters and TA and SA profiles are calculated. The results show that the most significant improvements can be obtained by including SSHA and latitude. On the other hand, SSTA and SSSA play an important role in the TA and SA reconstruction in the upper layers (>100 m), but this role decreases with increasing depth. In addition, by using the temperature and salinity monthly climatological fields, especially when added to the input of the regression model, fairly good profile predictions can be obtained, both for LSTM and GPR. This suggests that the introduction of climatology can effectively reduce the effect of sea surface errors and provide more information to the regression model, thus effectively improving the accuracy and robustness of the model. This also inspires us to explore multiple satellite measurements further to improve the reliability of the estimates in the future.

In general, the proposed methods can be effectively used for reconstructing temperature and salinity profiles, particularly when monthly climatology of temperature and salinity is included as an input to the GPR. The techniques presented in this article for estimating subsurface temperature and salinity do not require any prior knowledge or assumptions and are highly versatile and generalized. These models can accurately predict new subsurface temperature and salinity values as long as it is well-trained. It is expected that the proposed methods can be beneficial for the detection of the thermal structure of the ocean interior in marine science and climate change research, as well as for more precise analysis of temperature and salinity changes. However, it is important to note that, unlike dynamic-based reconstruction methods, the proposed methods are solely based on data and lack of physical interpretation, which may be a limitation in the development of artificial intelligence approaches to oceanography. With the rapid advancement of ocean modeling capability, observation technology and artificial intelligence, it is a promising direction to effectively combine the advantages of a dynamics-based approach and data-driven approach in future work, enhance the ability of model prediction and physical interpretation, and establish a deep learning model based on physical information.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material. Further inquiries can be directed to the corresponding author.

Author contributions

YC: Methodology, Investigation, Performer Experiment, Writing Original Draft. LL: Data Collection and Curation, Writing-Review. XC: Formal Analysis, Writing-Review and Editing. ZW: Supervision, Writing-Reviewing. XS: Writing-Review and Editing. CY: Writing-Review and Editing. ZG: Writing-Review, Project Administration, Funding Acquisition. All authors contributed to the article and approved the submitted version.

Funding

This study was supported by the National Key Research and Development Program of China (2021YFF0704000), the Taishan Scholars Progam (tsqn202211059) and the Fundamental Research Funds for the Central Universities (202265005, 202264007).

Acknowledgments

The authors would like to thank the reviewers for their valuable suggestions on the manuscript.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer XL declared a shared affiliation with the authors to the handling editor at the time of review.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Ali M., Swain D., Weller R. A. (2004). Estimation of ocean subsurface thermal structure from surface parameters: a neural network approach. Geophysical Res. Lett. 31, L20308. doi: 10.1029/2004GL021192