AUTHOR=Tran Yen Binh , Arias-Rodriguez Leonardo F. , Huang Jingshui TITLE=Predicting high-frequency nutrient dynamics in the Danube River with surrogate models using sensors and Random Forest JOURNAL=Frontiers in Water VOLUME=4 YEAR=2022 URL=https://www.frontiersin.org/journals/water/articles/10.3389/frwa.2022.894548 DOI=10.3389/frwa.2022.894548 ISSN=2624-9375 ABSTRACT=

Nutrient dynamics play an essential role in aquatic ecosystems. Despite advances in sensor technology, nutrient concentrations are difficult and expensive to monitor in-situ and in real-time. Emerging data-driven methods may provide surrogate measures for nutrient concentrations. In this work, we use 4-years of water quality data with high-frequency (15-min) intervals acquired at 2 automatic stations in the German Danube River to train data-driven algorithms and build surrogate measures for nitrate (NO3--N), ammonium (NH4+-N), and orthophosphate (PO43--P). Pre-processing of the data included removing outliers and filling missing values by linear interpolation. Multiple Linear Regression (MLR) and Random Forest (RF) are trained, cross-validated, and tested using dissolved oxygen (DO), temperature (Temp), conductivity (EC), pH, discharge rate (Q), and chlorophyll-a (Chl-a) as input futures. Additionally, we used time-series data to develop cyclical features to test improvements in the underlying relationship between data. This work presents a thorough description of the modeling workflow, including intermediate steps for feature engineering, feature selection, and hyperparameter optimization. In total, 12 surrogate models (2 algorithms * 3 constituents * 2 stations) are compared with R2 and RMSE as error metrics. The results show that RF outperforms MLR when adding at least three predictors for all the surrogate models. The MLR models give R2-values for NO3--N 0.67 and 0.89, NH4+-N 0.39 and 0.40, PO43--P 0.34 and 0.54 of Pfelling station and Jochenstein station, respectively. RF models produce accurate predictions and low error performances for all the targets NO3--N (R2 = 0.99 and 0.99), NH4+-N (R2 = 0.98 and 0.99), PO43--P (R2 = 0.96 and 0.96). The percentage improvement of RMSE for RF compared to MLR in prediction nutrients ranges from 73 to 92%. This work demonstrates the usefulness of surrogate models using the RF algorithm when reproducing nutrient dynamics and serving as soft sensors for monitoring nutrient concentrations.