AUTHOR=Stocker Matthew D. , Pachepsky Yakov A. , Hill Robert L. TITLE=Prediction of E. coli Concentrations in Agricultural Pond Waters: Application and Comparison of Machine Learning Algorithms JOURNAL=Frontiers in Artificial Intelligence VOLUME=Volume 4 - 2021 YEAR=2022 URL=https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2021.768650 DOI=10.3389/frai.2021.768650 ISSN=2624-8212 ABSTRACT=The microbial quality of irrigation water is an important issue as contaminated waters have been linked to several incidences of foodborne outbreaks. To expedite microbial water quality determinations, many researchers have turned to estimate concentrations of the microbial contamination indicator E. coli from the concentrations of chemical water quality parameters. However, these relationships mainly were non-linear and exhibited changes above or below certain thresholds. Machine learning (ML) algorithms have been shown to make accurate predictions in datasets with complex relationships. The purpose of this work was to evaluate several ML models for the prediction of E. coli in agricultural pond waters. Two ponds in Maryland were monitored from 2016 to 2018 during irrigation season. E. coli concentrations along with 12 other water quality parameters was measured in water samples. The resulting datasets were used to predict E. coli using stochastic gradient boosting machines, random forest, support vector machines, and k-nearest neighbor algorithms. The random forest model provided the lowest RMSE value for predicted E. coli concentrations in both ponds in individual years and over consecutive years in almost all cases. The RMSE for the random forest model using the 3-year dataset were 0.334 and 0.381 for Pond 1 and Pond 2, respectively. For individual years, this value ranged from 0.244 to 0.346 and 0.304 to 0.418 at Pond 1 and Pond 2, respectively, all in log10 E. coli concentrations. In most cases, there was no significant difference (P > 0.05) between RMSE of random forest and other ML models when these RMSE were treated as statistics derived from 10-fold cross-validation performed with five repeats. Important E. coli predictors were turbidity, dissolved organic matter content, specific conductance, chlorophyll concentration, and temperature. Model predictive performance did not significantly differ when 5 predictors were used versus 8 or 12, indicating that more tedious and costly measurements bring no substantial improvement in the ML predictive accuracy.