Predicting Escherichia coli levels in manure using machine learning in weeping wall and mechanical liquid solid separation systems

Shetty, B. Dharmaveer; Amaly, Noha; Weimer, Bart C.; Pandey, Pramod

doi:10.3389/frai.2022.921924

ORIGINAL RESEARCH article

Front. Artif. Intell., 04 January 2023

Sec. AI in Food, Agriculture and Water

Volume 5 - 2022 | https://doi.org/10.3389/frai.2022.921924

Predicting Escherichia coli levels in manure using machine learning in weeping wall and mechanical liquid solid separation systems

B. Dharmaveer Shetty¹

Noha Amaly^1,2

Bart C. Weimer¹

Pramod Pandey¹^*

¹Department of Population Health and Reproduction, School of Veterinary Medicine, University of California, Davis, Davis, CA, United States
²Polymeric Materials Research Department, Advanced Technology and New Materials Research Institute, City of Scientific Research and Technological Applications (SRTA-City), New Borg El-Arab City, Alexandria, Egypt

An increased understanding of the interaction between manure management and public and environmental health has led to the development of Alternative Dairy Effluent Management Strategies (ADEMS). The efficiency of such ADEMS can be increased using mechanical solid-liquid-separator (SLS) or gravitational Weeping-Wall (WW) solid separation systems. In this research, using pilot study data from 96 samples, the chemical, physical, biological, seasonal, and structural parameters between SLS and WW of ADEM systems were compared. Parameters including sodium, potassium, total salts, volatile solids, pH, and E. coli levels were significantly different between the SLS and WW of ADEMS. The separated solid fraction of the dairy effluents had the lowest E. coli levels, which could have beneficial downstream implications in terms of microbial pollution control. To predict effluent quality and microbial pollution risk, we used Escherichia coli as the indicator organism, and a versatile machine learning, ensemble, stacked, super-learner model called E-C-MAN (Escherichia coli–Manure) was developed. Using pilot data, the E-C-MAN model was trained, and the trained model was validated with the test dataset. These results demonstrate that the heuristic E-C-MAN ensemble model can provide a pilot framework toward predicting Escherichia coli levels in manure treated by SLS or WW systems.

Introduction

In confined agroecosystems such as intensive dairy farms, significant quantities of animal wastes are generated that need to be managed efficiently (Van Horn et al., 1994; Meyer et al., 1997; Neufeld et al., 2017). For example, about half a million cattle can produce in excess of 27,000 tons of animal waste per day within an area of < 26,000 square kilometers (Popova and Morra, 2017). On one hand, these large quantities of animal waste and farm effluents are a good source of nutrients, effluents also poses risk to water and environment. With an increased interest in the application of treated manure to grow forage crops as an alternative to chemical fertilizers, approaches which intend to improve recycling and uses of dairy manure, bring benefits in terms of economic and environmental perspectives (Neufeld et al., 2017; Popova and Morra, 2017). Even though multiple challenges exist toward handling large quantities of dairy effluents as they could negatively contribute to water contamination, air pollution, greenhouse gas emission, and the spread of antimicrobial resistant and pathogenic bacteria, which could have human, animal, and One Health implications (Owen and Silver, 2015; Sharma et al., 2017; Vadas et al., 2017; Niles and Wiltshire, 2019), there are multiple opportunities and benefits of using manure as fertilizers.

The interaction of manure management practices with environmental, human, and animal health, has gained increased attention, and such practices must be constantly assessed to develop research and educational programs that implement alternative effluent management techniques (Meyer et al., 1997). These techniques include manure digestors, settling ponds, evaporation ponds, and solid-liquid separators (Meyer et al., 2004; Liu et al., 2017). Most of these techniques, often implemented in combination, aim to reduce greenhouse gas emissions, eutrophication stress, odor, nutrients and microbial communities; and provide options to recycle dairy effluents for beneficial purposes (Zhang and Westerman, 1997; Wang et al., 2020).

The separation of solids manure from flush manure effluents in a dairy farm can increase the efficiency of the manure management process by reducing the excessive total solids, volatile solids, and microbial communities in flushed manure. This can assist in reducing the loads in downstream lagoons and greenhouse gas emission, and improve pumping efficiency, manure uses for irrigation purposes (Mukhtar et al., 2011; Neuhaus, 2020; Ellison and Horwath, 2021). A solid-liquid separation system produces two streams: (1) solid manure stream, which is composted and dried, and used for fertilizing crops and bedding material for dairy farms; (2) liquid manure stream often used for irrigating cropland with manure enriched with nitrogen and phosphorus (Mukhtar et al., 2011; Vanotti et al., 2020; Wu and Zhong, 2020). Common approaches for separating the manure solids from the liquid stream include sedimentation or gravitational settling and mechanical screening of solids (Mukhtar et al., 2011).

A mechanical solid-liquid separator with inclined screens and conveyor scraper separators is called the SLS (Solid Liquid Separator) System. A gravity separation system with a settling basin and a large dewatering surface area is called a WW (Weeping Wall) System. These are two commonly used solid separation techniques in the dairy farms of California, USA (Meyer et al., 2004; Mukhtar et al., 2011). In both the SLS and the WW effluent management systems, the dairy waste goes through multiple stages (Figure 1).

FIGURE 1

Figure 1. Conceptual flow of manure in the two comparative Dairy Effluent Management Systems found on two partner farms from Central California, USA. The Dairy Effluent Management System in part (A) includes the Mechanical Solid Liquid Separator (SLS) System and the Effluent Management System in part (B) includes the gravitational Weeping Wall (WW) System.

These stages include Stage 1 (S1), Flushed Effluent Stage (effluents are collected after flushing the barns and prior to separating the solid fraction), and Stage 2 (S2), Separated Liquid Stage, where liquid fraction is separated from solid fraction) and liquid fraction is stored in lagoons. In Stage 3 (S3), Separated Solid Stage, solid fraction of manure separated from liquid and solid is dried and composted, and Stage 4 (S4), where liquid manure (after solid settling in lagoons are recycled to dairy manure to flush dairy barn manure.

Previous studies have demonstrated that the SLS can reduce the Total Solid (TS) and Volatile Solid (VS) components from the effluent slurry by 60.9 and 62.8%, respectively (Chastain et al., 2001). Further, in combination with a two-chambered settling basin and a lagoon arranged in series, the SLS System was shown to reduce the TS by up to 93% and the VS by up to 95.6% from the dairy effluents (Chastain et al., 2001). In a single stage WW System, results shown to reduce TS by 49–63%, (Meyer et al., 2004) whereas a two-stage WW System reduced TS by 35% and VS by 40% (Mukhtar et al., 2011). The mean electrical conductivity, potassium, calcium, sodium, and chlorine soluble nutrient concentrations in a single stage WW remained unchanged between the incoming and outgoing stages of the WW System (Meyer et al., 2004).

Compared to the physical and chemical characteristics, there is limited research investigating the effect of the SLS and WW Systems toward reducing the risk of pathogenic bacteria in the dairy agroecosystems, though other forms of solid separation systems have shown variable but occasionally promising results (Boutilier et al., 2009; Liu et al., 2017; Wang et al., 2019). For example, one study demonstrated that dairy manure management practices that separate solid from liquid waste effectively reduced E. coli concentrations (Howard et al., 2017). Another non-comparative study on the WW ADEM demonstrated that it reduced Cryptosporidium oocysts, and the authors explained that this could be due to the separation of the oocysts in the liquid component of the manure (Hutchison et al., 2005). Such studies in related systems provided clues and directions for our study and its potential results.

In spite of the limited comparative studies in these specific ADEMS, it is well documented that implementing treatment methods, effective in reducing potentially pathogenic bacteria in manure, is important because many outbreaks of gastroenteritis across humans and animals have been related to livestock operations (Liu et al., 2017). Fecal coliforms including E. coli are common indicator organisms that are used to assess water quality and determine the presence of pathogenic microorganisms that may cause illness and disease in exposed human and animal populations (Boutilier et al., 2009). Previous studies showed that animal waste is one of the major sources of indicator organisms/E. coli in environment (Malakoff, 2002; Pandey and Soupir, 2012a). Observing the concentrations/loads of fecal coliforms, and E. coli (indicator organisms) in ambient water is often used to determine the microbial pollution in recreational and drinking water, and these strategies helps in protecting public health, and identifying the water bodies with potential pollution. Various government agencies including U.S. Environmental Protection Agencies (U.S. EPA) monitor E. coli levels in rivers, and lakes to determine the microbial water quality (Pandey et al., 2012b; Pandey and Soupir, 2013). Though there is much debate regarding the ability of indicator organisms to determine the human health risks caused by pathogenic bacteria, indicator organisms are widely used to determine microbial quality of water, and food. In environment, it is often challenging to determine the source of pathogens (animal waste, wildlife excreta, animal waste), animal waste is considered to be a leading source of microbial pollution in environment including ambient water (Malakoff, 2002; Dickerson et al., 2007; Pandey et al., 2014). In addition to indicator organisms, microbial source tracking to find the origin of fecal coliform are also used to reduce to public health risks (Scott et al., 2002; Grave et al., 2007; Ibekwe et al., 2011; Ma et al., 2014).

To advance our existing understanding in manure management, in this study, we attempted to bridge the prevailing knowledge gaps regarding the SLS and WW Systems. In this pilot scale study, we attempted to: (1) compare the similarities and differences in chemical, physical, and biological parameters across Dairy Effluent Management Systems containing the two solid separation systems (SLS and WW); and (2) create a versatile machine learning model using these parameters to predict the E. coli risk across the SLS and WW Systems in the dairy effluent systems from two farms in California, USA.

Materials and methods

Sample collection

In this pilot-scale study, manure characteristic data were collected from two representative dairy effluent management systems from two partner dairy farms in Central California, USA. While one of the effluent management systems included the SLS solid separation system, the other encompassed the gravitational WW solid separation system. Within each effluent management system in the two farms, samples were collected from the four stages, S1–S4 that have been described in the introduction (Figure 1). Each of the samples were collected from the top layer of the various stages and placed in independent tubes. Subsequently, the samples were maintained on ice during transportation, and stored in a cold room (4°C) upon arrival at the laboratory (Li et al., 2014), till they were processed for various parameters. Accordingly, a total of 96 manure samples were collected using a balanced sampling design for this pilot study. Forty-eight samples were collected from the Dairy Effluent Management System containing the mechanical SLS system, while another 48 samples were collected from the Management System containing the gravitational WW system. Within each of these two systems, a total of 24 samples were collected from each of the four stages, S1–S4. These samples were collected across two seasons. A total of 32 samples were collected in Spring, between March and May of 2019, whereas 64 samples were collected in Summer, during the month of June 2019.

Chemical, physical and biological parameters

A set of chemical, physical, and biological parameters were calculated for each of the collected samples.

Escherichia coli

Liquid samples (from Stages S1, S2, and S4) were collected and homogenized. Subsequently, 1 ml of each sample was aliquoted and diluted with 9 ml of Millipore-filtered water. On the other hand, the solid samples (from Stage S3) were prepared by diluting 5 grams of the solid effluent in 10 ml of Millipore-filtered water. The diluted samples were homogenized by vortexing for 2 min. These samples were serially diluted ( × 10, × 10⁻¹, 10⁻²× 10⁻³, × 10⁻⁴) using Millipore-filtered water, and 1 ml of the serially diluted samples were processed through a membrane filtration technique following the EPA method 1,603 (EPA, 2002). A VP-300 Vacuum Pump was used to create vacuum and a flow rate of 10 ml/cm² /min. An MCE White Membrane/Black Grid Membrane with 0.22 μm pore size and 28 cm² area (Sigma-Aldrich, St. Louis, Missouri, USA) was used for filtering the study samples, and subsequently, the filter paper was placed in a prepared modified mTEC Agar (Difco, Sparks, MD, USA) media in a Petri dish. The Petri dish with filter paper was placed in an incubator at 37°C for 24 h. Since this medium is selective, it can distinguish E. coli from other microorganisms. The presence of thermotolerant Escherichia coli was demonstrated by a chromogenic reaction, i.e., pink colonies, which were enumerated as Colony Forming Units per milliliter (CFU/ml). All the samples were filtered after fresh dilution. Each sample was analyzed in duplicates, and the experiment was repeated three times.

Total solids and volatile solids

Total Solids (TS) and Volatile Solids (VS) were measured using the Ignition Method (Clesceri et al., 1998). Initially, the empty dry crucible weight (W₁) was measured, and subsequently, an ~10 ml or 10 g sample was placed in the crucible. Then, the combined weight of the wet sample and crucible was measured (W₂). The sample was heated at 104°C for 16 h, and then, the weight of dried residue and crucible was measured (W₃). Subsequently, the TS were measured using the Eq. (1). For measuring VS, the samples that had been dried at 104°C were placed in a furnace for 4 h at 500°C. The combined sample and crucible weight was measured after ignition (W₄). Subsequently, the VS were measured using Eq. (2).

\begin{array}{l} T S = \frac{W_{3} - W_{1}}{W_{2} - W_{1}} & (1) \end{array}

\begin{array}{l} V S = \frac{W_{3} - W_{4}}{W_{2} - W_{1}} & (2) \end{array}

Chemical and physical characteristics

The nitrate, chloride, potassium, calcium, sodium, and total salt content, along with electrical conductivity and pH were measured for each sample. For liquid samples (collected from Stages S1, S2, and S4), 5 ml of each sample was transferred to an Eppendorf tube and centrifuged at 5,000 rpm (~2,432 g)for 5 min to separate any suspended solid content. For solid samples (collected from Stage S3), 1 gm of each sample was homogenously dispersed in 5 ml of water for 1 h and then separated by centrifugation at 5,000 rpm (~2,432 g) for 5 min. Subsequently, 500 μL of supernatant was taken and placed on sensors for measuring ions. Sensors for measuring nitrate, potassium, calcium, and salt ions, as well as electrical conductivity and pH, were obtained from Horiba (Horiba Limited, Japan). Each sensor was calibrated prior to measurement using known standards of nitrate, potassium, calcium, total salts, conductivity solution, and pH. All the measurements were conducted at room temperature and repeated three times for each sample.

Comparing dairy effluent characteristics across systems, stages, and seasons

Analysis of Variance (ANOVA) tests were used to determine statistically significant differences amongst the chemical, physical, and biological parameters between the various systems, stages, and seasons for the continuous variables. Using the “arsenal” package in “R” statistical software version 4.0.3, a parameter was considered statistically significant if the p-value was≤ 0.05. Subsequently, in order to visualize the correlations between the various chemical, physical, and biological parameters, a correlogram (i.e., correlation matrix) was constructed using the “corrplot” package in the “R” statistical software version 4.0.3 (Wei et al., 2017). The visualization in the correlogram was conducted by employing a color-coded heat map matrix.

Machine learning models

By using the various chemical, physical, biological, structural, and seasonal parameters, an assortment of machine learning models were built and compared to predict E.coli levels in dairy effluents using various packages through the “Caret” library in “R” (Kuhn, 2008). The dataset was (a) preprocessed to make it machine readable, (b) split into training and test datasets, and subsequently, (c) used to build and compare various machine learning base-learner and super-learner algorithms.

Creating a machine-readable dataset

Before building the models, the dataset was described and processed to make it machine-readable. Initially, the relationship between each of the chemical, physical, and biological parameters with the outcome variable, i.e., E. coli levels (CFU/ml), were visualized. The continuous predictor variables were plotted using smoothened scatter plots, whereas the categorical predictor variables were plotted using box-and-whisker plots. Subsequently, the dataset was analyzed for zero and near-zero variance predictors, linear dependencies amongst predictors, and highly correlated predictors. Predictors that have a single unique value, i.e., zero variance predictors, or predictors that have a handful of unique values that occur with very low frequencies, i.e., near-zero variance predictors, could cause some machine learning models to crash or create an undue bias. Linear dependencies between predictor variables should be identified to remove redundant variables from the dataset.

Highly correlated predictors could either improve or decrease the performance of select machine learning algorithms, and thus, it is important to identify such predictors prior to building comparative models. In addition to individually scanning for zero variance predictors, linear dependencies, and highly correlated predictors, the model variables were also subject to select normalizing transformations using the BestNormalize R package in order to improve the efficiency of the regression machine learning algorithms (Peterson and Peterson, 2020). After transformation, the success of the normalizing transformation was determined by various mechanisms, including creating visual plots, using the Shapiro test, and by calculating the skewness value.

Splitting the dataset into training and test datasets

The machine-readable dataset was split into a training and test dataset. The probability-based createDataPartition function in the R package caret (Kuhn, 2008) was used to obtain a 80:20 balanced split of the dataset. Using the training dataset, a Missing Data model, a Dummy Variable model, and a Transformation model were developed. While the Missing Data model, which used the K-nearest neighbor (knn) method, was used to impute missing predictor values and simultaneously, center and scale all the predictor values. A Dummy Variable model was developed to create one-hot coded variables for the categorical predictors, and a Transformation model was used to transform all the predictor values into a range from 0 to 1. Subsequently, using these three models, missing data was imputed, dummy variables were created for categorical predictors, and predictor values were transformed for both the training dataset and the test dataset, respectively.

Machine learning algorithms

More than 30 machine learning models, including 28 base-learner and independent regression models from seven different families of algorithms (Supplementary Table 1), and multiple super-learner ensemble stacked models were constructed and compared. The 28 independent models were included from the following families: generalized linear models, random forest models, boosting models, support vector machine (SVM) models, multivariate adaptive regression splines (MARS), neural networks, and partial least square models. Each of these models were trained with the training dataset. In order to optimize each of these models, the individual models were cross-validated, and the constituent model hyper parameters tuned by using a repeated k-fold cross validation (repeated CV) method with 10 folds and 20 repeats.

Super-learner ensemble models

Multiple stacked super-learner ensemble models were also constructed by using combinations of the base-learner regression models in multiple frameworks, such as a Generalized Linear Model (GLM) framework. The GLM ensemble were built using the Root Mean Square Error (RMSE) metric, and cross-validation was performed by tuning the parameters using a repeated k-fold cross validated method with 10 folds and 20 repeats. Amongst multiple combinations, the ensemble trials included GLM ensembles of all the 28 base-learner models, random forest (RF) ensembles of all the 28 base-learner models, subsets of the base-learner models, and the seven best-fit models from the seven different families of machine learning models used in the present study.

Choosing the model with the highest predictive capability

The Root Mean Square Error (RMSE) metric, the most commonly used method for evaluating and comparing a models predictive capabilities (Kuhn and Johnson, 2013), was used to select the relatively best model. This metric is a function of the model residuals, and the square root of the Mean Square Error (Kuhn and Johnson, 2013). The predictive capability of a model is negatively correlated with its RMSE value, i.e., the lower the RMSE values, the higher the predictive capability of the model. For the final model that was chosen, the values were reversed transformed.

Relative importance of the individual predictors in the machine learning model

Subsequently, the relative importance of individual predictors was calculated for the best fit independent base-learner model using the LOESS (Locally Estimating Scatter Plot Smoothening) R-squared variable importance method, where the R² statistic is calculated against the intercept only null model. In order to evaluate the analytical authenticity of the results, the top two predictors were removed from the data, and the model was rerun.

Results and discussion

Chemical, physical, and biological parameters

Chemical, physical and biological parameters were calculated for each of these 96 samples (Table 1). The Analysis of Variance (ANOVA) tests demonstrated that potassium, pH, E. coli, sodium, total salts, and volatile solid levels, in descending order of significance, were statistically different (p≤0.05) between the Management Systems containing the SLS and the WW Systems. Calcium, nitrates, sodium, potassium, total solids, volatile solids, and E. coli levels were significantly different in at least one of the four different stages, S1, S2, S3, or S4. Similarly, calcium, nitrates, sodium, potassium, total solids, volatile solids, E. coli, and total salt levels were also significantly different in at least one of the eight different combinations of systems (n = 2) and stages (n = 4). Only E. coli, sodium, potassium, and volatile solid levels were independently and significantly different between at least one of the two systems, one of the four stages, as well as one of the eight system-stage combinations. None of the variables were significantly different during either the Spring or the Summer season.

TABLE 1

Table 1. Descriptive effluent characteristics in dairy effluent management systems.

Descriptive correlogram

The correlation matrix (Figure 2) visualizes the strength and direction of correlation between the various chemical, physical, and biological parameters analyzed from the collected dairy effluent samples. The E. coli levels were positively correlated with potassium (+0.39) and nitrates (+0.26), and negatively correlated with total solids (−0.30) volatile solids (−0.30), and pH (−0.22). The sodium levels were positively correlated with nitrates (+0.44), calcium (+0.33), and total salts (+0.3), and negatively correlated with total solids (−0.3) and volatile solids (−0.3). The total solids were positively correlated with volatile solids (+0.64), and negatively correlated with nitrates (−0.4), calcium (−0.36), potassium (−0.33), sodium (−0.33), and E. coli levels (−0.30). The volatile solids were positively correlated with total solids (+0.64), and negatively correlated with calcium (−0.47), potassium (−0.43), nitrates (−0.42), sodium (−0.38), and E. coli (−0.3). Other variables which displayed positive correlations are nitrates and sodium (+0.44), total salts and calcium (+0.4), and nitrates and potassium (+0.39).

FIGURE 2

Figure 2. Correlogram showing the correlation matrix between the various dairy effluent parameters. The positive correlations and negative correlations are colored blue and red, respectively. The intensity and size of the circle are proportional to the correlation coefficients. Abbreviations used: TS, Total Solids; VS, Volatile Solids; K, Potassium; NO3, nitrates; Na, Sodium; Ca, Calcium; Salt, Total Salts; EC, Electrical Conductivity. The image was generated using the “Corrplot” package in R version 4.0.3.