AUTHOR=Zhang Steve , Ahamed Firnaaz , Song Hyun-Seob TITLE=Knowledge-informed data-driven modeling for sparse identification of governing equations for microbial inactivation processes in food JOURNAL=Frontiers in Food Science and Technology VOLUME=2 YEAR=2022 URL=https://www.frontiersin.org/journals/food-science-and-technology/articles/10.3389/frfst.2022.996399 DOI=10.3389/frfst.2022.996399 ISSN=2674-1121 ABSTRACT=

Prevention of the growth of harmful microorganisms in food products is an important requirement for ensuring food safety and quality. Mathematical models to predict the quantitative changes in microbial populations in food to the variations of environmental conditions are useful tools in this regard. While equations for microbial inactivation have typically been formulated based on polynomial functions, empirical choice of the model order and terms not only results in over- or underfitting, but also makes it difficult to identify key factors governing the target variable. To address this issue, we present a data-driven modeling pipeline that enables 1) automatic discovery of model equations through parsimonious selection of relevant terms from a pre-built library and 2) subsequent evaluation of the impacts of individual terms on the model output. Through case studies using literature data, we evaluated the effectiveness of our pipeline in predicting the D-value (i.e., the time taken to reduce microbial population to 10% of the initial level) as a function of multiple factors including temperature, pH, water activity, NaCl content, and phosphate level. In doing this, we determined basic functional forms of input and output variables based on their pre-known relationships, e.g., by accounting for the Arrhenius dependence of D-value on temperature. Incorporation of such theoretical knowledge into the pipeline improved model accuracy. Using the Akaike information criterion, we optimally determined hyperparameters that control a trade-off between model accuracy and sparsity. We found the literature models benchmarked in this study to be over- or under-determined and consequently proposed better structured and more accurate equations. The subsequent global sensitivity analysis allowed us to evaluate the context-dependent impacts of key factors on the D-value. The pipeline presented in this work is readily applicable to many other related non-linear systems without being limited to microbial inactivation datasets.