Application of machine learning algorithms for prediction of ultraviolet absorption spectra of chromophoric dissolved organic matter (CDOM) in seawater

Ju, Aobo; Wang, Hu; Wang, Lequan; Weng, Yuang

doi:10.3389/fmars.2023.1065123

ORIGINAL RESEARCH article

Front. Mar. Sci., 30 January 2023

Sec. Marine Biogeochemistry

Volume 10 - 2023 | https://doi.org/10.3389/fmars.2023.1065123

Application of machine learning algorithms for prediction of ultraviolet absorption spectra of chromophoric dissolved organic matter (CDOM) in seawater

Aobo Ju

Hu Wang^*

Lequan Wang

Yuang Weng

State Key Laboratory of Marine Geology, Tongji University, Shanghai, China

The ultraviolet absorption spectra of chromophoric dissolved organic matter (CDOM) can be used to trace its sources and to explore the dynamic of the CDOM pool. In previous studies, only the spectra above 240 nm can be used directly to characterize the CDOM in seawater, due to the overlapping of CDOM absorption spectra below 240 nm with inorganic chemicals such as ${NO}_{3}^{-}$ , ${NO}_{2}^{-}$ , Cl^- and Br^-. In this study, three different machine learning models, back propagation neural network (BPNN), random forest (RF) and extreme gradient boosting (XGBoost), were built to predict the CDOM ultraviolet absorption spectra between 215 and 350 nm after being trained with the raw absorption spectra of seawater. The optimal input wavelength range of the raw seawater spectra is 250-350 nm, and the optimal model parameters of machine learning algorithms were determined by using five-fold cross validation. The results show that the three models can well predict the CDOM absorption spectra. Comparatively, the XGBoost model gave the best prediction results. The reasons might be related to the fact that the XGBoost algorithm focuses on the residuals generated by the last iteration, which can reduce both variance and bias, especially for datasets with small sample sizes. Based on the predicted spectra by XGBoost algorithm, we calculated the spectra slopes of short wavelengths between 215 and 240 nm (S_215-240) and between 215 and 275 nm (S_215-275). The results show that the S_215-240 and S_215-275 are ~2 times the widely used spectra slopes between 275 and 295 nm (S_275-295) obtained by traditional method based on the raw spectra. Moreover, the S_215-240 and S_215-275 are more relavant with salinity for marine CDOM than S_275-295, suggesting spectra slopes of shorter wavelengths might be the better proxies for marine CDOM than that of longer wavelengths.

1 Introduction

Chromophoric dissolved organic matter (CDOM), which is also called yellow substance, widely exists in oceans, lakes and rivers. It plays a key role in climate-related biogeochemical cycles in aquatic ecosystems, such as carbon dynamics, phytoplankton activity, microbial growth and ecosystem productivity (Nelson and Siegel, 2013; Stedmon and Nelson, 2015). CDOM is a soluble and complex mixture of many kinds of organic substances, including humic acid, fulvic acid and aromatic polymers (Li and Hur, 2017; Zhang et al., 2021), which constitutes a significant fraction of the DOM pool in natural waters (10 ~ 90%) (Twardowski et al., 2004). CDOM can absorb both ultraviolet and visible (UV-Vis) light and it is well known that the optical properties of CDOM in seawater can be used to trace its sources and to explore the dynamic of the CDOM pool (Whitehead et al., 2000; McKnight et al., 2001; Stedmon and Markager, 2001; Baker and Spencer, 2004; Guo et al., 2007; Yang et al., 2013; Yamashita et al., 2013; Jørgensen et al., 2014). However, due to the complexity of CDOM compositions, it is difficult to link the optical absorbance directly to CDOM concentrations or its chemical compositions (Del Castillo and Coble, 2000; Zhao et al., 2018; Nima et al., 2019). Since the UV-Vis absorption spectra of CDOM decrease approximately exponentially with increasing wavelength, exponential models are generally used to describe CDOM absorption spectra (Stedmon and Markager, 2001; Twardowski et al., 2004; Helms et al., 2008; Li and Hur, 2017). The most often used model is given in Equation (1).

\begin{array}{l} A_{CDOM} (λ) = A_{CDOM} (λ_{0}) e^{S (λ_{0} - λ)} + k & (1) \end{array}

where λ is the wavelength (nm), λ₀ is a reference wavelength (nm), A_CDOM(λ) and A_CDOM(λ₀) are the CDOM absorbance at the wavelength of λ and λ₀, k is a background constant (m^-1), S is the spectral slope (nm^-1) that describes the approximate exponential rate of decrease in absorption with increasing wavelengths.

The S, k and Equation (1) for characterizing different CDOM are usually obtained over the wavelength ranges of > 275 nm (e.g., 275-295, 350-400 and 300-600 nm) (Twardowski et al., 2004; Li and Hur, 2017). Only recently, Massicotte and Markager (2016) used a Gaussian decomposition approach to model CDOM absorption spectra between 240 and 700 nm, which can remove the errors associated with the choice of the spectral range used to estimate S. However, the spectra below 240 nm can’t be modelled directly using Equation (1), because several inorganic ions in seawater including ${NO}_{3}^{-}, {NO}_{2}^{-}$ , Cl^- and Br^- have strong absorbance between 190 and 250 nm (Figure 1), which overlap with that of CDOM (Guenther et al., 2001; Johnson and Coletti, 2002). Vice versa, when measuring ${NO}_{3}^{-}$ and ${NO}_{2}^{-}$ by ultraviolet spectroscopic method, CDOM would interfere with the analyzing results (Armstrong, 1963; Johnson and Coletti, 2002; Sakamoto et al., 2009). Therefore, unraveling the CDOM UV absorbance below 240 nm can improve the understanding of CDOM light absorbance characteristics and help to determine ${NO}_{3}^{-}$ and ${NO}_{2}^{-}$ concentrations in seawater accurately when using spectroscopic techniques.

FIGURE 1

Figure 1 Absorption spectra of ${NO}_{3}^{-}, {NO}_{2}^{-}$ , and sea salt (salinity) in seawater.

Machine learning algorithms can cope with nonlinearity and other complex regression problems (Verrelst et al., 2012). In the last decade, machine learning techniques have been increasingly employed to estimate CDOM abundance and trace their sources and reactivity. However, most of these studies built machine learning models based on CDOM fluorescence spectroscopy (Stedmon et al., 2003; Stedmon and Bro, 2008; Murphy et al., 2008; Nelson and Gauglitz, 2016; Murphy et al., 2018; Sun et al., 2022), which can provide rich information with its three-dimensional data (i.e. excitation, emission and intensity) (Stedmon et al., 2003; Stedmon and Bro 2008; Coble et al., 2014; Murphy et al., 2018; Marín-García and Tauler, 2020). In addition, many scholars have developed algorithms based on remotely sensed reflectance to characterize CDOM (Cao and Miller, 2015; Ruescas et al., 2018; Zhao et al., 2018), although it is difficult to obtain an accurate estimation of CDOM from satellite data due to its low optical signals and absorption spectral shapes that are similar to those of nonphytoplankton particulate matter (Zhang et al., 2021). To data, there are no reports on CDOM UV-Vis spectra below 240 nm combined with machine learning models.

In this work, we aim to model the UV-Vis absorption spectra of CDOM in seawater between 215 and 350 nm by machine learning models based on the raw absorption spectra of seawater. Three machine learning algorithms, back propagation neural network (BPNN), random forest (RF) and extreme gradient boosting (XGBoost), were implemented to establish the prediction models. The optimal input wavelength range and model parameters were selected and the results from the three algorithms were evaluated and compared.

2 Materials and methods

2.1 Apparatus

A UV-Vis spectrophotometer (Specord plus 210, Analytik Jena AG, Germany) was used to collect absorption spectra of seawater from 200 to 350 nm. All the samples were measured in a 3.0 cm quartz cuvette with spectral resolution set to 0.2 nm. The script programs for the XGBoost, RF and BPNN algorithms were written based on python language.

2.2 Data preprocessing

2.2.1 Dataset

Water samples with different ${NO}_{3}^{-}$ , ${NO}_{2}^{-}$ and CDOM concentrations and salinities were collected from the Changjiang River Estuary and East China Sea. These samples were split into a training and test set at a ratio of 2:1, and 20% of the training set samples were randomly taken as validation set. Notably, the splitting ratio of training and test sets can be 3:1 or 4:1, etc according to the number of samples. While selecting the samples, care was taken to include one-, two- and three-component of ${NO}_{3}^{-}$ , ${NO}_{2}^{-}$ and salinity with various concentrations in the training set in order that the built models have better prediction performance (Mitchell, 1997; Quinonero-Candela et al., 2008). Hence, several natural seawater samples were diluted by Milli-Q water or added by standard ${NO}_{2}^{-}$ solutions considering the very low ${NO}_{2}^{-}$ concentrations in samples compared with ${NO}_{3}^{-}$ (Table 1). The resulting ${NO}_{3}^{-}$ and ${NO}_{2}^{-}$ concentrations and salinities in the training set ranged from 0 to 85.62 μM, 0 to 14.60 μM and 0 to 35.42 PSU (practical salinity units), respectively (Table 1), which can cover their concentrations in the Changjiang River Estuary and East China Sea.

TABLE 1

Table 1 Samples of the training and test sets.

2.2.2 Calculation of the theoretical CDOM absorption spectra

In seawater, the main inorganic and organic chemical substances absorbing UV light include ${NO}_{3}^{-}, {NO}_{2}^{-}$ , salinity and CDOM. As a result, the CDOM absorbance can be obtained from the difference between the total seawater absorbance (A_λ) and the absorbance of ${NO}_{3}^{-}, {NO}_{2}^{-}$ and salinity ( $A_{{NO}_{3}^{-}}$ , $A_{{NO}_{3}^{-}}$ , A_salinity ), which can be shown in Equation (2). Based on the Beer-Lambert law, Equation (2) can be changed to Equation (3).

\begin{array}{l} A_{CDOM, λ} = A_{λ} - (A_{{NO}_{3}^{-}, λ} + A_{{NO}_{2}^{-}, λ} + A_{salinity, λ}) & (2) \end{array}

\begin{array}{l} A_{CDOM, λ} = A_{λ} - b \times (ϵ_{{NO}_{3,}^{-} λ} \times C_{{NO}_{3}^{-}} + ϵ_{{NO}_{2}^{-}, λ} \times C_{{NO}_{2}^{-}} + ϵ_{salinity, λ} \times salinity) & (3) \end{array}

Where b is the path length (cm) of the optical cell, ϵ is the absorption coefficient of the subscripted species (l mol^-1 m^-1 for ${NO}_{3}^{-}$ and ${NO}_{2}^{-}$ , PSU^-1 m^-1 for salinity), C is the concentration of the subscripted species. Each ϵ value can be obtained by measuring the absorption in the standard solutions with known concentrations. ${NO}_{3}^{-}$ and ${NO}_{2}^{-}$ concentrations were measured by conventional wet-chemical analyses (colorimetric Griess assay) using an AA3 Auto-Analyzer (Bran Luebbe Co., Germany). ${NO}_{2}^{-}$ was determined using the pink azo dye spectrophotometric method at wavelength of 543 nm. ${NO}_{3}^{-}$ was first reduced to ${NO}_{2}^{-}$ using a cadmium column before measurement. Salinity values of the samples were from an in-situ conductivity-temperature-depth (CTD) recorder (SBE911, Sea-Bird Co., USA) onboard.

2.3 Machine learning algorithms

2.3.1 BPNN

BPNN is a multi-layer feedforward artificial neural network trained by error back propagation algorithm (Rumelhart and McClelland, 1986; Zhou and Li, 2020). It consists of input layer, hidden layer and output layer. In BPNN, considering the neurons between different layers are inter-connected, each layer is also called as a fully connected (FC) layer, which can combine the features from the previous layer (Hecht-Nielsen, 1992; Erb, 1993; Li et al., 2012; Tawfik et al., 2018). BPNN uses activation functions to accomplish nonlinear data transformation, which is added to the FC layer and allows the network to create arbitrary nonlinear complex mappings between inputs and outputs. The commonly used activation functions include sigmoid and ReLU, which are shown in Equations (4) and (5).

\begin{array}{l} f (x) = σ (x) = \frac{1}{1 + e^{- x}} & (4) \end{array}

\begin{array}{l} f (x) = ReLU (x) = {\begin{matrix} 0 f o r x < 0 \\ x f o r x \geq 0 \end{matrix} & (5) \end{array}

The BPNN model is trained by continuously adjusting the weights and thresholds of each neuron, and the training processes consist forward propagation and backward feedback. The former transmits the output values layer by layer, while the latter sums the error derivatives for weights in the reverse direction until all the data are run through the network once (Dong et al., 2020). This constitutes an epoch, and the weights are updated after each epoch such that the model error decreases (Primadusi et al., 2016).

In this paper, the input, output and the theoretical output of BPNN model are respectively x = (x₁, x₂, ···, x_n), y = (y₁, y₂, ···, y_m) and d = (d₁, d₂,···, d_m). The x_n represents the n-th wavelength of the input raw seawater spectra, the y_m and d_m represent the m-th wavelength of the calculated and theoretical output CDOM spectra, respectively.

2.3.2 RF

As an ensemble learning method, RF creates multiple decision classification trees with random subsets of the original training dataset (Breiman, 2001; Cutler et al., 2012). By averaging the predictions of each decision tree, RF can get a more accurate result. The training subset of each tree is generated by a bootstrapping procedure, which divides the training dataset into an “in-bag” subset for the training of the decision tree and an “out-of-bag” subset not included in the training process. This partitioning is unique for each tree in the forest and hence provides a significant internal validation. As a result, RF can overcome the disadvantages of overfitting and instability and has good robustness and high interpretability (Khoshgoftaar et al., 2007; Primadusi et al., 2016). More specific details of RF algorithm can be found in Breiman’ article (Breiman, 2001).

Here, the samples in the training set are randomly sampled repeatedly by bootstrap resampling technology to generate K sub-training sets, and each sub-training set constructs a regression tree. The prediction result of CDOM spectrum in the i-th seawater sample can be calculated as follows:

\begin{array}{l} y (x_{i}) = \frac{1}{K} \sum_{j = 1}^{K} y (x_{i, j}, θ_{j}) & (6) \end{array}

Where x_{i, j} denotes the input raw seawater spectra of the i-th sample in the j-th sub-training set, θ_j is the random variable of the j-th regression tree, y(x_i,j,θ_j ) represents the predicted CDOM spectra of the j-th regression tree for the i-th sample.

2.3.3 XGBoost

XGBoost is an improved algorithm based on gradient boosting decision tree. It is developed to increase the computing speed and accuracy, and thus require less training and prediction time. Instead of averaging independent trees, XGBoost recursively adds decision trees that are created from the prediction errors or residuals of the previous tree model until no significant improvement is detected (Abdel-Rahman et al., 2017). Unlike RF, where the decision trees run in parallel and there is no interaction between trees, XGboost generates trees in chronological order with constant error correction.

The objective functions of the XGBoost algorithm consist of a loss function (L) and a regularization term (Ω) that suppresses the complexity of the model, which are shown in Equations (7) and (8). The loss function represents the bias of the model, and the inclusion of the regularization term reduces the variance to prevent overfitting. Both the bias and variance are used to determine the prediction accuracy of the model (Fan et al., 2018). More specific details can be found in Chen and Guestrin’s research (Chen and Guestrin, 2016).

\begin{array}{l} Obj = \sum_{i = 1}^{n} L (y_{i}, {\hat{y}}_{i}) + \sum_{i = 1}^{t} Ω (f_{i}) & (7) \end{array}

\begin{array}{l} Ω (f_{t}) = γT + \frac{1}{2} λ \sum_{j = 1}^{T} ω_{j}^{2} & (8) \end{array}

Where Obj is the objective function, L is the loss function term, Ω is the regularization term, ${\hat{y}}_{i}$ is the predicted value of the i-th sample, y_i is the theoretical value of the i-th sample, γ is the leaf tree penalty regular term with pruning effect, T is the number of leaf nodes per tree, λ is the leaf weight penalty regular term to prevent overfitting, ω is the leaf weight value.

Here, the input X of the model is a matrix with size N × M, where N is the number of seawater samples and M is the number of input raw seawater spectral wavelengths. The output Y of the model is a matrix with size N × L, where N is the number of seawater samples and L is the number of output CDOM spectral wavelengths. The model is trained with the samples in the training set to minimize Obj in Equation (7).

2.4 Evaluation of the algorithmic model performance

The prediction accuracy and performance of different algorithmic models are evaluated with the correlation coefficient (R²), mean absolute error (MAE), and root mean square error (RMSE) between the theoretical and the predicted CDOM spectra between 215 and 350 nm. These evaluation metrics are defined as follows (Equations (9)-(11)).

\begin{array}{l} R^{2} = \frac{SSR}{SST} = \frac{\sum_{i} {({\hat{y}}_{i} - \bar{y})}^{2}}{\sum_{i} {(y_{i} - \bar{y})}^{2}} & (9) \end{array}

\begin{array}{l} MAE = \frac{1}{N} \sum_{i = 1}^{N} | y_{i} - {\hat{y}}_{i} | & (10) \end{array}

\begin{array}{l} RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}} & (11) \end{array}

Where y_i and ${\hat{y}}_{i}$ are the theoretical and predicted absorbance of CDOM in the i-th sample, $\bar{y}$ is the average of the theoretical absorbance of CDOM in those samples, and N is the number of samples.

2.5 Model development

In this study, the BPNN, RF and XGBoost algorithms were used to establish the spectral prediction model of CDOM, respectively. The flow chart of the model development is shown in Figure 2, which includes three steps.

FIGURE 2

Figure 2 Flowchart of the machine learning models for CDOM spectrum prediction.

Step 1. Data preprocessing. Use the instruments and methods mentioned above to obtain the raw seawater spectra of each sample. Analyze the ${NO}_{3}^{-}$ and ${NO}_{2}^{-}$ concentrations in the samples using colorimetric Griess assay. Then, the theoretical CDOM spectra were calculated using Equation (3).

Step 2. Model construction. In order to avoid overfitting, five-fold cross validation was used to select the optimal model parameters. The absorption spectra of the samples in the training set were used to train the three machine learning models, respectively, and then the validation set samples were evaluated with the evaluation metrics in Section 2.4. In order to obtain the best prediction results, the input wavelength range and model parameters (layers, the node number and training epochs of BPNN, the number of trees of RF and XGBoost, etc.) need to be tuned according to the evaluation results.

Step 3. Model application. Use the built BPNN, RF and XGBoost models to predict the CDOM absorption spectra between 215 and 350 nm for the samples in the test set. The prediction results were evaluated by comparing to the theoretical spectra obtained in Step 1.

3 Results and discussion

3.1 The theoretic CDOM absorption spectra

The absorption spectra of CDOM between 215 and 350 nm in each sample was calculated based on Equation (3). The absorbance below 215 nm was not calculated because the absorbance was saturate. The results show that all the spectra show a similarly exponential decay model, with the absorbance decreasing rapidly from 215 and 240 nm and then decreasing slowly above 240 nm. Comparatively, the coastal water samples with lower salinity had higher absorbance of CDOM. For example, the train sample 40 (salinity = 15.54) and test sample 19 (salinity = 11.58) had absorbance of higher than 0.35 at 215 nm (Figure 3). While those samples with higher salinity had lower absorbance, such as the train sample 9 (salinity = 33.23) and test sample 1 (salinity = 31.44) (Figure 3).

FIGURE 3

Figure 3 CDOM ultraviolet absorption spectra of seawater samples.

The presence of a broad spectral peak between 260 and 275 nm characterized most samples, which had also been found in previous studies (Guenther et al., 2001; Johnson and Coletti, 2002). The reason was ascribed to the specific kind of organic matter. Noteworthy, the sulfides have also an absorbance peak near 260 nm. An absorbance of > 0.5 has been observed in sediment pore water (Guenther et al., 2001). However, in oxygenated and alkaline seawater, the sulfide concentrations are normally very low. Its absorbance can be neglected.

3.2 Selection of wavelength range for model input

The wavelength selection is to choose an optimal wavelength range of raw seawater spectra with which the established model has the best prediction ability. Contrarily, the inclusion of uninformative wavelengths in the training process would affect the accuracy of prediction and model interpretability. Here, we trained the models using the raw seawater absorption spectra with different wavelength ranges between 215 and 350 nm, e.g. 230-350, 240-350, 230-340, 240-340, 250-350 nm, etc. Then, the results were evaluated by calculating the R², MAE and RMSE between the predicted CDOM spectra and theoretical CDOM spectra of the validation set. The wavelength interval with maximal R² and minimal MAE and RMSE was selected as the optimal wavelength range. The prediction accuracies of the models using different wavelength ranges are shown in Figure 4. The results suggest that the optimal wavelength range was 250-350 nm for both the BPNN and XGBoost models, which had the maximal R² of 0.786 and 0.809, the minimal RMSE of 0.0103 and 0.0095 and the minimal MAE of 0.0043 and 0.0037 respectively. For the RF model, although 240-350 nm was the optimal wavelength interval, its prediction results (R² = 0.8061, RMSE = 0.0096, MAE = 0.0036) were very close to that of 250-350 nm range (R² = 0.8040, RMSE = 0.0095, MAE = 0.0037). Hence, 250-350 nm was chosen to train the three models and predict the CDOM spectra between 215 and 350 nm.

FIGURE 4

Figure 4 The (A) R², (B) RMSE and (C) MAE between the predicted and theoretical absorbance for different wavelength ranges.

3.3 Optimization of model parameters

3.3.1 The epoch of the BPNN model

In this study, the BPNN model consisted of three FC layers with nonlinear activation functions. The first two activation functions were the ReLU functions, while the third activation function was the Sigmoid function. The number of training epoch is another important model parameter for BPNN model. The inadequate or excessive training epochs may cause underfitting or overfitting and affect the prediction performance. The R² and RMSE of the validation set were used to determine the optimal number of training epoch for the model. The BPNN model was trained 1000 epochs using samples of the training set, and the resultant R² and RMSE of the training and validation sets are shown in Figure 5A, B. It suggests that the R² of both the training and validation sets increased as the epoch increased and remained stable until 200 epochs (Figure 5A). Accordingly, the RMSE decreased with the increase of epoch till 200 (Figure 5B). Therefore, 200 epochs were selected as the training times for the BPNN model.

FIGURE 5

Figure 5 Variation of R² (A) and RMSE (B) with the numbers of epoch of the BPNN model. The error bars indicate ± 1 standard deviation.

3.3.2 The number of decision trees of the RF and XGBoost model

In XGBoost and Random Forest models, the number of trees represent the number of base classifiers. Less trees would lead to a poor model performance and higher prediction error. Since XGBoost and RF models don’t cause over-fitting, the number of trees can be as large as possible to make the model have good generalization ability. However, the superfluous trees would increase the complexity of the model and the running time of the model.

Similar to BPNN, the R² and RMSE of the training and validation sets were applied to determine the optimal numbers of trees for XGBoost and RF models, which are shown in Figures 6A-D. For XGBoost model, as the tree numbers were more than 200, the R² was the largest and RMSE was the lowest for both the training and validation sets. Hence, 200 was chosen as the optimal number of trees. Similarly, for RF model, 50 was chosen as the optimal number of trees.

FIGURE 6

Figure 6 Variation of R² and RMSE of the RF (A, B) and XGBoost (C, D) models with different numbers of decision trees. The error bars indicate ± 1 standard deviation.

3.4 Comparison of the results from the built BPNN, RF and XGBoost models

Based on the optimal wavelength range and model parameters, the three machine learning models were trained using the spectra of the training set samples. Then it was used to predict the CDOM spectra of the test set samples. In Figure 7, we show the prediction results of CDOM spectra of several samples in the test set. For comparison, the theoretical CDOM spectral are also shown. The results suggest that there was no significant difference between the results from the three models, and they were consistent with the theoretical CDOM spectra between 215-350 nm. Comparatively, the XGBoost model gave the best prediction results, which had the highest R² and lowest RMSE and MAE (Figure 4).

FIGURE 7

Figure 7 The predicted results from the BPNN, RF and XGBoost models. (A) - test sample 4, (B) - test sample 16, (C) - test sample 20, (D) - test sample 8.

Furthermore, we plotted the predicted absorbance at 215, 220 and 240 nm of all the samples in the test set against the theoretical absorbance in Figure 8. The correlation coefficients (R²) and slopes can be used to evaluate the correlation and close proximity between the predicted and theoretical absorbance. We found that both XGBoost and RF models have better R² and slope at 215, 220 and 240 nm. While XGBoost model had slopes closer to 1 (0.92, 0.92 and 1.00), although they had similar R² values. This indicates the predicted absorbance was more fit to the theoretical value. It is known that, as integrated learning algorithms, XGBoost and RF can overcome the disadvantage of overfitting by creating multiple decision trees (Breiman, 2001; Chen and Guestrin, 2016). In addition, both XGBoost and RF have been shown to outperform BPNN in prediction performance for training set with small sample size (Luckner et al., 2017; Ogunleye and Wang, 2019; Han et al., 2021). However, the difference between RF and XGBoost is that the RF algorithm focuses on the final voting results of all decision trees and can only reduce the variance, while the XGBoost algorithm focuses on the residuals generated by the last iteration. Therefore, XGBoost can reduce both variance and bias (Oh and Lee, 2017; Zhang et al., 2019). These reasons might explain the best performance for XGBoost model, especially for the data sets with limited samples.

FIGURE 8

Figure 8 The relationships between the predicted and theoretical absorbance of CDOM at 215, 220 and 240 nm for samples in the test set using BPNN (A-C), RF (D-F) and XGBoost (G-I) algorithms. Red dash lines represent line of 1:1 of predicted to theoretical absorbance. Green solid lines represent fitted line between the predicted and theoretical absorbance.

It should be noted that the build method can also be used to predict the UV absorption spectra for seawater samples collected from a variety of marine environments, such as eutrophic or oligotrophic waters. However, we recommend using local training sets, considering that the CDOM compositions and light absorbance might vary in different waters.

3.5 The spectra slopes of short wavelengths

It is known that spectra slope, S in Equation (1), is an important parameter to describe the shape of UV-Vis spectra, which can be used as indicators of molecular size and weight and its sources (Bricaud et al., 1981; Helms et al., 2008; Stedmon and Nelson, 2015). For absorption measurements of CDOM, the main problem is the low absorption at long wavelengths in combination with the limited length of the cuvette and the possible scattering effect of particles and bubbles. The slope of the shorter wavelengths can be measured with high precision and therefore more reliable than the values at longer wavelengths (Markager and Vincent, 2000; Helms et al., 2008). However, the calculation of S values at shorter wavelength than 240 nm based on the raw spectra is problematic due to the interference of other substances besides CDOM.

Generally, the widely used spectra slope is calculated between 275 and 295 nm by traditional method (S_275-295T), which is based on the raw spectra and employing non-linear regression of Equation (1). For comparison, we use our predicted spectra by XGBoost algorithm to obtain the spectra slopes of the test set samples between 215 and 240 nm (S_215-240) and between 215 and 275 nm (S_215-275). The reference wavelength was set at 295 nm. The results are shown in Table 2. It suggests that S_215-240 and S_215-275 have similar values ranging from 0.030 to 0.066 nm^-1 and 0.024 to 0.044 nm^-1, respectively, which are almost twice the S_275-295T (0.015 to 0.035 nm^-1). This result is consistent with previous observations indicating increasing S values with decreasing wavelengths (Twardowski et al., 2004; Swan et al., 2013; Wei et al., 2016).

TABLE 2

Table 2 Comparison of S_215-240 and S_215-275 based on the predicted spectra and S_275-295T calculated using the traditional method.

The relationships between S and salinity are plotted in Figure 9. We found that all the S values show similar distribution shape. For nearshore samples with lower salinities (<27), there is no big difference in S, suggesting that these samples have similar CDOM composition. However, for marine samples with comparatively higher salinities (>27), there is an increasing trend of S with increasing salinities. Previous efforts have demonstrated that S correlates strongly with molecular weight and size (Helms et al., 2008; Stedmon and Nelson, 2015). Low molecular weight CDOM has stronger absorbance at shorter wavelengths (<300 nm), causing higher (or steeper) spectra slopes, and vice versa (Stedmon et al., 2000; Helms et al., 2008; Lei et al., 2019). Generally, marine CDOM has chromophores with smaller molecule size and weight, while terrestrially dominated CDOM has higher molecule size and weight (Stedmon et al., 2000; Helms et al., 2008; Fichot and Benner, 2012; Zhao et al., 2021). Consequently, our results support these previous observations. Interestingly, both S_215-240 and S_215-275 are more relevant with salinities (R²>0.70, Figure 9A, B) than S_275-295T (R^{2 =} 0.34, Figure 9C) for marine CDOM, indicating that spectra slopes of shorter wavelengths might be the better proxies for marine CDOM than that of longer wavelengths.

FIGURE 9

Figure 9 The correlation between salinities and spectra slopes (A)-S_215-240, (B)-S_215-275, (C)-S_275-295T.

4 Conclusions

We present a technique of machine learning to model the UV absorption spectra of CDOM in seawater between 215 and 350 nm for the first time. Three machine learning models, BPNN, RF and XGBoost, were constructed based on the raw seawater UV absorption spectra and the results were compared with each other.

The optimal input wavelength range for the three models was 250-350 nm. By choosing the optimal model parameters based on five-fold cross validation, all the three models can well predict the CDOM absorption spectra between 215 and 350 nm. Comparatively, the XGBoost model had the best prediction performance, which had the highest R² and lowest RMSE and MAE.

Spectra slopes of short wavelengths, S_215-240 and S_215-275, are higher than the widely used S_275-295T. More interestingly, S_215-240 and S_215-275 have better correlation with salinity than S_275-295T for marine CDOM, suggesting that spectra slopes of short wavelengths might be more suitable to describe marine CDOM. We strongly advocate inclusion of spectra slopes of short wavelength in future CDOM studies.

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Author contributions

HW and AJ are the principal investigators and initiated the project. AJ and HW wrote the first draft of the manuscript, paper. AJ and LW performed the spectral measurements. YW contributed to the data analysis and data processing. All authors contributed to the article and approved the submitted version.

Funding

This work was supported by The National Key Research and Development Program of China (No. 2022YFC2805504) and National Natural Science Foundation of China (No. 42076062).

Acknowledgments

The authors would like to thank all the participants and crew of the cruises KECES-2020 for collecting samples.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Abdel-Rahman E. M., Mutanga O., Odindi J., Adam E., Odindo A., Ismail R. (2017). Estimating Swiss chard foliar macro-and micronutrient concentrations under different irrigation water sources using ground-based hyperspectral data and four partial least squares (PLS)-based (PLS1, PLS2, SPLS1 and SPLS2) regression algorithms. Comput. Electron. Agr. 132, 21–33. doi: 10.1016/j.compag.2016.11.008