Artificial intelligence and machine learning applications for cultured meat

Todhunter, Michael E.; Jubair, Sheikh; Verma, Ruchika; Saqe, Rikard; Shen, Kevin; Duffy, Breanna

doi:10.3389/frai.2024.1424012

REVIEW article

Front. Artif. Intell., 24 September 2024

Sec. AI in Food, Agriculture and Water

Volume 7 - 2024 | https://doi.org/10.3389/frai.2024.1424012

Artificial intelligence and machine learning applications for cultured meat

Michael E. Todhunter¹^†^‡

Sheikh Jubair²^†^‡

Ruchika Verma²

Rikard Saqe³

Kevin Shen⁴

Breanna Duffy⁵^*^†

¹Todhunter Scientifics, Minneapolis, MN, United States
²Alberta Machine Intelligence Institute, Edmonton, AB, Canada
³Department of Mathematics, University of Waterloo, Waterloo, ON, Canada
⁴Department of Biology, University of Waterloo, Waterloo, ON, Canada
⁵New Harvest, Sacramento, CA, United States

Cultured meat has the potential to provide a complementary meat industry with reduced environmental, ethical, and health impacts. However, major technological challenges remain which require time-and resource-intensive research and development efforts. Machine learning has the potential to accelerate cultured meat technology by streamlining experiments, predicting optimal results, and reducing experimentation time and resources. However, the use of machine learning in cultured meat is in its infancy. This review covers the work available to date on the use of machine learning in cultured meat and explores future possibilities. We address four major areas of cultured meat research and development: establishing cell lines, cell culture media design, microscopy and image analysis, and bioprocessing and food processing optimization. In addition, we have included a survey of datasets relevant to CM research. This review aims to provide the foundation necessary for both cultured meat and machine learning scientists to identify research opportunities at the intersection between cultured meat and machine learning.

1 Introduction

Food production generates about a quarter of global greenhouse gas emissions and causes other negative impacts on the environment and human health (Crippa et al., 2021; Steinfeld et al., 2006). Animal products, including meat, seafood, eggs, and dairy contribute more than 56% of food’s emissions, despite providing only 37% of protein and 18% of calorie intake (Poore and Nemecek, 2018). Meat production is a large sector, producing 328 Mt. in 2020 and expected to expand to 374 Mt. by 2023, based on estimates from the Organisation for Economic Co-operation and Development (OECD) and the Food and Agriculture Organization (FAO) of the United Nations (Michele, 2021). In light of projected growth in the global population and income, these estimates projected that meat consumption will increase by 14%. To meet global meat demand and limit global warming to 1.5°C there is a need for major changes in the production of meat (Clark et al., 2020; Ivanovich et al., 2023). Cultured meat (CM), known by many names including “cell-based” or “cultivated” meat, is an emerging technology that uses tissue engineering and biomanufacturing techniques to produce animal meat through cell culture rather than animal husbandry. Proponents of the technology herald its potential to provide an option for producing animal agriculture products with reduced environmental, ethical, and health impacts (Sinke et al., 2023; Tuomisto and Teixeira de Mattos, 2011). However, major technological challenges remain in bringing CM products to market and achieving their proposed benefits (Risner et al., 2021). Many challenges stem from the fact that the technologies for mammalian tissue culture come primarily from the medical field, where the scale is much lower and the market has weaker incentives to reduce the costs of production. However, if these technologies are to be applied to food, the challenges of scale and cost must be addressed.

Some specific improvements to CM can be, and have been, made using traditional experimental approaches. However, tackling some of the more complex research questions requires more advanced approaches and experimental conditions. In recent years, an increasing number of groups have been using methods enhanced by artificial intelligence (AI), and in particular its subset machine learning (ML), for such tasks. ML can streamline experiments, predict optimal results, and reduce experimentation time and resources. There are many opportunities for ML to accelerate research development and reduce costs in CM. Some companies have indicated the use of ML in CM product development or CM-associated services (Business Wire, 2023; Ho, 2021; Leach, 2024; Marston, 2022; Penarredonda, 2017; Protein Report. Protein Report, 2022; Shieber and This, 2021; Southey, 2023), but very little of this progress in applying ML to CM has been shown or validated in the public domain. Academic and government-supported research in this space is emerging, including at the University of California at Davis, Virginia Tech, Tufts University, and The CentRe of Innovation for Sustainable banking and Production of cultivated Meats (CRISP Meats). However, few publications on the topic exist to date, which are summarized in Table 1 (Cosenza et al., 2023; Cosenza et al., 2022; Cosenza et al., 2021; Nikkhah et al., 2023; Ng and Tan, 2024). An increase in open public research on the use of ML to optimize and scale CM production would greatly accelerate the application of ML to the CM field.

Table 1

Table 1. Summary of published works on the application of AI/ML to CM.

In this review, we aim to provide the foundation necessary for researchers, from the CM or ML fields, to identify research opportunities at the intersection between CM and ML. We first provide a brief overview of both fields. Subsequent sections delve into both existing and potential ML applications for optimizing cell lines, formulating culture media, aiding cell culture microscopy and image analysis, and optimizing bioreactor and food processing parameters. Note that we have focused on CM challenges to which ML can be applied, and this should not be seen as a comprehensive review of all challenges in the CM or ML fields. As applications of ML in CM are limited, we discuss how ML methods that have been applied in other areas of bioinformatics can be adopted to solve tasks in CM. Finally, by combining existing literature and atlases, a compilation of animal biology datasets has been created for different CM-relevant species (Supplementary Table S1).

2 Background on the fields of cultured meat and machine learning

2.1 Cultured meat

CM aims to replicate the taste and texture of animal tissue within a manufacturing system using animal cells (Figure 1). First, cells from a species of interest are selected or engineered for desirable growth and differentiation traits in vitro. The selected cells are grown in a suitable medium, providing the nutrients and signaling cues that would normally be provided in the body. At early stages, such as cell selection, the cell culture may start at a small scale in plastic dishes or flasks. Cell culture is eventually scaled up to industrial-scale bioreactors, devices capable of controlling environmental temperature, pH, dissolved oxygen, and nutrient exchange at large volumes (Post et al., 2020). Once enough cells are grown, they are differentiated into mature cell types. At this stage, the cell culture may be formed into a tissue (i.e., structured product, such as a steak) or a cell slurry, which is later processed into a meat product (i.e., unstructured product, such as ground beef).

Figure 1

Figure 1. Cultured meat manufacturing process. Reproduced from Reiss et al. (2021) with permission.

This process leans heavily on medical tissue engineering, an area of research that has been studied for nearly 40 years and has been commercialized at a small scale for simple grafts of cell-laden scaffolds for skin and cartilage. However, complete tissues, such as skin with hair follicles and sebaceous glands or functional muscle, still face technological barriers (Beheshtizadeh et al., 2022). While tissue function is not critical for CM, using this technology for the production of food comes with unique constraints, especially the need for larger scale, lower costs, and materials that are both edible and palatable.

How closely CM products replicate the properties of conventional meat varies depending on the techniques used and will rely on further technological development to reach the goal of equivalent taste, texture, nutrition, and cost. A comparison of CM and conventional meat has been reviewed elsewhere (Fraeye et al., 2020; Broucke et al., 2023; Chriki et al., 2024).

Commercial and academic interest in CM has grown rapidly in the last decade. The number of companies working on CM grew from 1 to over 170 from 2011 to 2023, with over 3 billion dollars of investment (Battle et al., 2024). Similarly, academic interest in the field has grown dramatically, with 350+ papers published on CM in the last 2 years, more than all other years prior (with tracking) combined (Battle et al., 2024). In parallel, cost estimates for CM have come down significantly in the last decade: from the first CM demonstration in 2013 of a 140-gram burger at approximately €250,000 (Kupferschmidt, 2013), to recent claims of costs as low as $7.70/lb from industry developers (Poinski, 2021). However, these industry cost claims have yet to be proven publicly. Furthermore, CM production has yet to be shown on a scale close to that needed to offset even a fraction of current meat consumption (FAO, 2020). Current production capacity is not known, but estimates range from 1 to 10 kg/y, compared to the 3.2 × 10¹¹ kg/y produced by conventional meat (Humbird, 2020). Highly optimized industrial mammalian cell lines, such as Chinese hamster ovary (CHO), are still produced at a much lower scale and higher cost than needed for food production (Humbird, 2020). Given that consumers are unlikely to want to consume hamster ovary cells, significant technological challenges must be overcome for meat and seafood-relevant cells to be produced at a scale and cost needed to replace a meaningful portion of the conventional meat market.

2.2 Machine learning

An ML workflow involves a series of steps, as illustrated in Figure 2, which starts with preparation of the dataset. A dataset typically consists of datapoints, each one an observation with features that describe the datapoint. The type of data varies widely: numerical, time series, text, images, audio, video, sequential, graph, or any combination of these. This data undergoes preprocessing, where procedures such as imputing missing values and reducing data dimensionality are undertaken. Since many machine learning models require numerical data, a data transformation step converts the data to numerical values (Kabas et al., 2023; Kayakuş and Açıkgöz, 2022). Subsequently, an integral component of the ML methodology is dividing data into distinct subsets: training, validation, and testing. Some common approaches, such as K-fold cross-validation, leave-one-out, and holdout validation, can typically be used for most problems (Bishop, 2006). Prior to model training, it may be necessary to employ feature selection or extraction techniques on the training set to identify the most informative variables in the data, and these features must be used in both the test and validation sets to maintain the integrity of the evaluation process.

Figure 2

Figure 2. Illustration of the typical steps of how machine learning can be applied to biological data.

The culmination of this preparatory work is a dataset suitable for training an ML model. Typically, experiments are conducted with different ML algorithms/architectures (such as random forests, k-means, and deep neural networks) to settle on the most performant overall model. However, sometimes the choice of ML model depends on the type of data. For example, convolutional neural networks (CNNs) are preferred for image data since they can extract spatial features from the image (Li et al., 2022). On the other hand, recurrent neural networks (RNNs) are typically used for sequential data since they can remember information in a long sequence through their gate mechanism (Lipton et al., 2015).

The training set is instrumental in building the model since the model adjusts its parameters to learn the distribution of the training set. The validation set is required for fine-tuning the model’s hyperparameters and ensuring that the model generalizes well beyond the training data, thereby avoiding overfitting. Finally, the test set provides a measure of the model’s predictive accuracy and overall performance in real-world scenarios. Some of the typical performance metrics for classification tasks are accuracy, precision, recall, and F1-Score. Typical performance metrics for regression are R² score, mean absolute error, and mean square error (Alpaydin, 2020). However, lots of other performance metrics are used depending on the tasks and types of data. The purpose of performance metrics is to compare the performance between different models and also to understand how a model is performing overall for the specific task. In Figure 2, to make the machine learning procedure easily understandable, we provided the example of a simple machine learning process. Note, this example does not include procedures that involve reinforcement learning (discussed in Types of Machine Learning subsection) or other complex supervised and unsupervised learning scenarios.

2.3 Types of machine learning

ML methodologies can be broadly classified into three categories: supervised, unsupervised, and reinforcement learning (RL). In supervised learning, the model is trained on a labeled dataset, where each example is paired with an outcome or label that aligns with the objective of the specific task at hand. Consider the goal of predicting gene expression levels from DNA sequences; here, the data point would be the DNA sequence, while the associated expression level serves as the label. The model hones its predictive capability by learning the relationship between input features and their corresponding labels. When it comes to evaluation, the trained model is tasked with predicting labels for new, unseen data points. Key traditional supervised learning models include logistic/linear regression, k-nearest neighbors, and support vector machines. The supervised ML approach has made significant contributions across various domains in bioinformatics, enabling advancements in DNA segmentation, gene expression prediction, and protein structure prediction (Larrañaga et al., 2006).

Unsupervised learning, in contrast, does not rely on labeled data. It is particularly useful when the goal is to unearth underlying patterns or structures within the data, independent of predefined outcomes. This makes unsupervised learning a potent tool for exploratory analysis, especially in scenarios where labeled data is scarce or when the structure of the data is not fully understood. Some of the key unsupervised learning approaches are k-means, hierarchical clustering, and density-based spatial clustering of applications with noise. Unsupervised learning has been successfully applied in grouping functionality-related genes, microarray analysis, and biological image segmentation (Larrañaga et al., 2006; Parasa et al., 2021).

RL is a dynamic and adaptive approach well-suited for situations where a machine is required to make a series of decisions to achieve a desired goal or perform an optimization task, offering a framework for learning through interaction (Sutton and Barto, 2018). Within this paradigm, an agent—often an advanced ML model in deep RL—engages in a sequential decision-making process, each time interacting with a complex environment represented by a set of variables that define the current state of the environment. The agent executes actions, transitioning between states, and ultimately may reach a terminal state, signaling the conclusion of the decision sequence. The RL framework incorporates a system of rewards and penalties, with the agent receiving feedback in the form of rewards for beneficial actions or penalties for undesirable outcomes. The objective for the agent is to devise a strategy that maximizes cumulative rewards, thus steering toward the most optimal actions to attain its final goal. RL has recently been applied to bioinformatics (Angermueller et al., 2020; Jumper et al., 2021; Neftci and Averbeck, 2019) and gained lots of attention because of its success in areas such as sequence alignment (Jafari et al., 2019) and protein loop sampling (Barozet et al., 2020).

2.4 Neural networks

Recent advances in applying ML to biology are based on neural networks, a subset of ML methods that can be employed in all three ML categories: supervised, unsupervised, and RL (Menden et al., 2020). ML models with neural networks are often referred to as deep neural networks when there are multiple layers of neural networks in the architecture of the model, a method more broadly known as deep learning. These neural network layers typically attempt to mimic the activity of brain neurons, where each neuron employs a mathematical function that alters the data it receives from the previous layer (LeCun et al., 2015; Schmidhuber, 2015). At first, the data is fed into an input layer, which then connects to hidden layer(s) (used for computing), and finally an output layer, designed to deliver the final prediction. The model learns by optimizing the function parameters of each node. Some popular neural network architectures are feed forward networks, CNNs, RNNs, and transformers (LeCun et al., 2015; Schmidhuber, 2015; Vaswani et al., 2017). Deep learning models typically capture complex biological processes and incorporate heterogeneous data in the model through its different layers, which may be a necessity for many optimization and prediction tasks in CM (Li et al., 2020). It can be applied to both supervised and unsupervised scenarios. In RL, deep learning models can be employed as agents as well (Li, 2018).

2.5 Generative AI

Generative AI is another subfield of unsupervised learning which typically aims to generate new data or samples based on the patterns learned from the training data. Variational Autoencoders (VAEs) (Kingma and Welling, 2019; Singh and Ogunfunmi, 2022; Wei and Mahmood, 2021), which employ deep learning, are the first generation of generative AI models. VAEs typically employ two deep learning models: (i) an encoder that encodes the input into a latent space and (ii) a decoder that reconstructs the input by sampling from this latent space using variational inference techniques. Together, the encoder and decoder minimize the reconstruction loss of the input. Figure 3 shows the general architecture of VAE.

Figure 3

Figure 3. General architecture of a variational autoencoder (VAE).

The second generation of generative models is deep adversarial networks (Creswell et al., 2018; Zhang et al., 2017). These architectures also have two networks: a generator and a discriminator. Initially, generators generate new data randomly and discriminators classify whether the generated sample is original or generated. Iteratively, both the generator and discriminator learn how to generate a better sample and distinguish between original or generated sample. Eventually, the generator learns to generate realistic data samples that are difficult for the discriminator to distinguish from real data.

Recent advances in generative AI are mostly based on transformers (Vaswani et al., 2017), which have revolutionized various fields including natural language processing, computer vision, and bioinformatics (Ji et al., 2021; Zhang et al., 2024). Transformers have demonstrated exceptional capabilities in capturing long-range dependencies and modeling complex sequential data. In bioinformatics, methodologies heavily draw inspiration from NLP techniques due to the inherent similarities between biological sequences and natural language texts. By employing transformers, researchers are able to effectively model biological sequences such as DNA, RNA, and protein sequences, leading to significant advancements in tasks such as sequence generation, structure prediction, and drug discovery (Jumper et al., 2021; Ji et al., 2021; Zhang et al., 2024; Bran and Schwaller, 2023; Nguyen et al., 2024).

2.6 Graph neural networks

Graph neural networks (GNNs) are another area of ML that works with graph structured data and biological networks, such as via protein interaction networks, gene coexpression networks, and metabolic networks (Scarselli et al., 2009; Zhou et al., 2020). A biological network or biological graph typically is arranged in terms of nodes and edges, where the nodes are biological entities (i.e., genes, proteins, metabolites, etc.) and the edges indicate how these entities relate to one another. GNNs can be used to model these complex networks and predict how perturbations will affect the whole network.

GNNs are based on the principle of message passing, where each node in the graph aggregates all embeddings (messages) of its neighbor nodes and updates the weights of pooled messages through a neural network. There are three main tasks in graph structured data. The first one is link prediction between two biological entities. Examples of link prediction are predicting the interactions between proteins or predicting interactions between genes/regulatory elements (Kumar et al., 2020; Li et al., 2022). The second task is node functionality prediction. For example, predicting an unknown function of a protein based on the physical interactions between proteins in the protein interaction network (Muzio et al., 2021). The third application of ML in network analysis is to classify sub-network functionality. Figure 4 shows different ML tasks in a graph or network structured data. Muzio et al. classified functions of subnetwork based on a molecule’s toxicity or solubility (Muzio et al., 2021). In addition, ML is also applied to obtain subnetwork embedding where the subnetwork is represented as a vector preserving the important information within the subnetwork in a numeric form. This embedding can facilitate further analysis of the subnetwork and is used for various downstream tasks (Nelson et al., 2019).

Figure 4

Figure 4. Typical machine learning tasks for network analysis.

3 Cell lines

Meat consists of various cell types, predominantly muscle cells (approximately 90%), with fat and connective tissue cells accounting for the remaining 10% (Listrat et al., 2016). Additionally, there are some vascular, neural, and tissue-resident immune cells present in small amounts (Listrat et al., 2016; Ben-Arye and Levenberg, 2019; Reiss et al., 2021). To provide the taste, texture, and aroma expected in meat, most CM development has focused on muscle, fat, and connective tissue production. Cells will also likely contribute to the nutritional properties of CM products, and cell optimization may be used to tailor the nutritional profile of CM (Smith-Uchotsk and Wanjiru, 2023; Stout et al., 2020).

The cell types used to produce CM range from lineage-committed progenitor cells (e.g., muscle satellite cells, myoblasts, or preadipocytes) to stem cells able to differentiate into a broader set of cells (e.g., mesenchymal stem cells, embryonic stem cells, or induced pluripotent stem cells) (Ben-Arye and Levenberg, 2019). Developers may use primary cells, meaning cells that are isolated directly from an animal. However, cell lines, which are established cultures of cells that have been selected and optimized, are ideal because they are more consistent, characterized, and reduce the use of animals in the supply chain. Furthermore, immortalized cell lines or pluripotent stem cells are of particular interest due to their ability to escape the typical limits on population doublings seen in most primary cells (i.e., Hayflick limit) (Cong et al., 2002; Hayflick, 1965). A detailed review of cells used in the production of CM can be found elsewhere (Reiss et al., 2021).

Few CM-relevant cell lines are currently well characterized and commercially available, with most coming from model organisms used in biomedical research, such as mouse, rat, or zebrafish, and cell lines for many agricultural species still need to be developed (Soice and Johnston, 2021). Cell lines for marine species, especially invertebrates such as mollusks or crustaceans, are especially underdeveloped, with only a few reported cell lines and not all are food-relevant species (Béjar et al., 2002; Buonocore et al., 2006; Gignac et al., 2014; Goswami et al., 2023; Krishnan et al., 2023; Li et al., 2021; Parameswaran et al., 2007; Parton et al., 2007; Potter et al., 2020; Saad et al., 2023). For many species, there is a lack of basic knowledge of their physiology and the biochemistry required for in vitro culture or immortalization (Soice and Johnston, 2021; Rubio et al., 2019; Musgrove et al., 2024). The lack of knowledge of molecular and genetic markers as well as few species-specific antibodies available makes identifying and curating cell lines difficult for under-studied species (Musgrove et al., 2024; Ravikumar et al., 2024). ML can help biologists analyze complex cellular data to assist with identifying ideal cell line populations and optimizing existing cell lines through gene perturbations.

3.1 Network analysis is a tool to model biological interactions

Optimization of cell lines is often challenging because it requires understanding the “state” of a cell or cell population (i.e., what the gene and protein networks are doing) and selecting or engineering for desired cell states. Measuring and predicting these states involves interpreting complex interactions between genes and proteins, identifying those that are important for specific qualities, and predicting how perturbations will affect the whole network. ML can be used to model these interactions through network analysis. Network analysis is used throughout sections 3 and 4, so a general overview is first presented here. In recent years, GNNs have been employed for biological network analysis. GNNs have been implemented to predict protein interactions (Jha et al., 2022; Yang et al., 2020), molecular interactions (Huang et al., 2020; Kang et al., 2022), metabolite-disease associations (Sun et al., 2022) and obtain subnetwork embeddings (Ciortan and Defrance, 2021) that can be used to identify the functions of a biological subnetwork. Generative deep learning strategies, such as VAEs and deep adversarial network-based models, are also trained on biological networks by employing GNNs (Sun et al., 2022; Zhang et al., 2020). The most famous example to date is AlphaFold2, a deep learning GNN model that was employed on amino acid sequences to predict more than 200 million protein structures (Jumper et al., 2021). This is a significant advance in the field of structural biology which can be used in protein design, drug target prediction, cell type identification, and antibody development. AlphaFold2 can play a big role in designing new proteins and understanding the physical relationships between proteins (Evans et al., 2024).

Network analysis can employ multiple types of biological data within a multi-omics setting. Since each omics technique typically captures a specific biological process, integrating multi-omics data can provide a holistic overview of the biological process. In a multi-omics ML approach, different ML models are employed on different types of data (such as genomics, transcriptomics, proteomics, and metabolomics) to obtain numerical representations. These numerical representations are then combined together to obtain a more informative representation used to predict final outputs. For example, MOGONET employs a GNN on mRNA expression, DNA methylation, and miRNA expression data to predict disease information (Wang et al., 2021). Similar datasets can also be used for CM to predict cell line features. More applications of multi-omics ML methods are discussed in section 3.2.

Furthermore, network analysis can help determine aroma and flavor (Lee et al., 2023). Since meat aroma and flavor are largely controlled by metabolic pathways (Ramalingam et al., 2019), network analysis can be applied to biosynthetic pathways to enhance or add flavors to CM products. Network optimization has been used to design yeast that overexpress licorice glycoside (Huang et al., 2021), and although licorice is a flavor that few would want in a steak, one could imagine more savory corollaries. In other yeast experiments, transcriptomic network analysis has been used to improve acid resistance, and acid resistance is just as important to mammalian culture as to microbial culture (Li et al., 2021).

3.2 Machine learning can help to analyze complex omics data to identify and characterize new cell lines

“Omics” approaches, such as genomics, transcriptomics, proteomics, and metabolomics, are powerful tools for identifying and characterizing cell lines. In particular, RNA sequencing (RNA-seq) technologies are commonly used to quantify cellular gene expression to validate and optimize cell lines. However, analyzing RNA-seq or other-omics data across many candidate cells is a complex and daunting analytical task. ML can help these analyses in multiple ways, including by grouping functionality-related cells using an unsupervised approach (Ciortan and Defrance, 2021), profiling gene expression using a supervised approach (Chen et al., 2016), and identifying different tissue types using unsupervised (Li et al., 2020; Lin et al., 2020) and semi-supervised approaches (Alvarez et al., 2020).

Using gene expression data to cluster cells based on cell type or behavior can help explain heterogeneity among cell populations and discover subpopulations with beneficial characteristics. When establishing cells for CM production, scientists may want to isolate only certain cell types with optimal attributes or remove undesirable cell types. For example, using single cell RNA-seq (scRNA-seq), Messemer et al. found that an isolation from cattle muscle contained 11 distinct cell types (Messmer et al., 2023). This work led to a better understanding of the cells derived from a primary isolation, as well as cell surface markers suitable for the identification and separation of populations by flow cytometry. Additionally, in a recent preprint, Melzener et al. used RNA-seq to study subpopulations during muscle differentiation, understanding cell fates with an aim to improve the efficiency of cell differentiation and maturation (Melzener et al., 2024). ML is also increasingly being explored for the modeling of cell trajectories via scRNA-seq (Qiu et al., 2022).

Unsupervised ML can help to map cellular heterogeneity by grouping functionality-related cells, identifying cell sub-populations, and performing dimensionality reduction (Li et al., 2020; Lin et al., 2020; Brendel et al., 2022). Typically, the input to the unsupervised ML model is gene expression data obtained from RNA-seq. In conventional ML frameworks, which are not based on neural networks, the outcome is generally an assigned cluster number for each cell or gene. For an in-depth exploration of how traditional ML techniques are applied in this context, we direct readers to the comprehensive review by Petegrosso et al. (2019). In contrast, unsupervised deep learning methods predominantly leverage autoencoders, which compress high-dimensional cellular data into a more manageable lower-dimensional space while retaining essential information (Li et al., 2020; Lin et al., 2020; Brendel et al., 2022). This lower dimensional representation of the encoder can be used to obtain clusters of cells (Li et al., 2020; Lin et al., 2020; Svensson et al., 2020; Tian et al., 2019; Wang and Gu, 2018) and can be further fine-tuned for gene expression profiling (Menden et al., 2020; Alharbi and Vakanski, 2023).

Recently, another family of autoencoders that uses GNNs has been employed to obtain the lower dimensional representation of transcriptomics data. These GNN-based autoencoders use the knowledge of biological networks, such as gene–gene relationships, cell interaction networks, protein interaction networks and biological pathways, along with gene expression data to obtain a more robust and informative representation of cells and genes (Ciortan and Defrance, 2021; Brendel et al., 2022; Gan et al., 2022; Rao et al., 2021; Shan et al., 2023; Wang et al., 2021; Wen et al., 2022). The GNN autoencoders are unique compared to traditional autoencoders as they encode information on biological interaction between entities along with structural information and biological properties.

Semi-supervised approaches have been employed in many tasks, such as learning responses due to gene perturbation, different functional score prediction due to gene knockouts, identifying different tissue types, and transcriptome analysis (Alvarez et al., 2020; Aromolaran et al., 2020; Chuai et al., 2018; He et al., 2020; Lotfollahi et al., 2021; Osorio et al., 2022; Tian et al., 2018). These models are adept at leveraging both labeled and unlabeled data, typically employing the unlabeled data to train an autoencoder and then using the labeled data to fine-tune the autoencoder toward the target outcomes (Alvarez et al., 2020; He et al., 2020; Bernstein et al., 2020). Fine-tuning is the process of adjusting pretrained model parameters for a specific task by utilizing a small labeled dataset. The semi-supervised approach is notably different from its unsupervised counterpart as it provides a degree of guided learning, which is crucial when the available labeled data is sparse but critical for the identification of specific labels. For instance, while the categorization of cells based on functional similarities may fall under unsupervised learning—grouping cells by inherent characteristics—the task of pinpointing cell doublets could benefit from a semi-supervised model that utilizes a limited set of known doublet samples to enhance its predictive accuracy.

In addition to transcriptomic data, further single-cell multimodal sequencing technologies have been developed that provide cell-specific information, such as chromatin accessibility (scATAC-seq) (Baek and Lee, 2020) and surface proteins (CITE-seq) (Stoeckius et al., 2017). These data types provide complimentary insight into scRNA-seq, such as improving accuracy in modeling gene regulatory networks in the case of scATAC-seq (Huang et al., 2023; Kim et al., 2023). This understanding can aid in CM-related tasks, such as identifying transcriptomic/epigenetic markers predictive of high proliferation and differentiation potential given previously observed variability in primary cell culture performance (Melzener et al., 2022; Meßmer, 2023; Metzger et al., 2020).

In an ML context, autoencoders and GNN-based deep learning models are mostly applied to this multimodal data (Athaya et al., 2023). Some of these autoencoders employ individual encoders and decoders for each modality. We refer the readers to a review of multimodal single-cell models, and a review on general best practices for single cell analysis, to learn more about the topic (Athaya et al., 2023; Heumos et al., 2023).

3.3 Antibody design for characterization and isolation of novel species can be aided by machine learning

In cell line development it is crucial to both characterize cells, commonly via omics data (as discussed above) or visual identification (discussed in section 5), and isolate the cells of interest, typically using flow cytometry cell sorting. Visual identification and flow cytometry both require the use of antibodies with specific binding affinity to known cellular markers. Markers of muscle cell differentiation are well understood for most mammalian species (Yu et al., 2021). However, in under-explored species, such as fish or aquatic invertebrates, these proteins are not always shared with mammalian cells or, to the degree they are, have low sequence conservation (Musgrove et al., 2024; Liongue and Ward, 2007). For example, recent work with an Atlantic mackerel (Scomber scombrus) skeletal muscle cell line found that antibodies for the muscle satellite cell marker paired-box protein 7 (PAX7) was successful, while early myogenic marker myoblast determination protein 1 (MYOD) was not (Saad et al., 2023). Surface markers, which are most commonly used during flow cytometry cell sorting, are particularly understudied in CM-relevant cell types, with most existing research focused on immunology (Liongue and Ward, 2007; Charoensawan et al., 2010).

Attempts at validating commonly used mammalian antibodies in fish have led to issues with cross-reactivity and specificity, suggesting that establishing cell lines for CM research will require the repurposing of existing antibodies, if not the development of fully novel antibodies (Antuofermo et al., 2023). The use of ML on omics data has been shown to aid antibody development, such as via improving species cross-reactivity, antibody co-optimization, and binding affinity (Bennett et al., 2024; Hie et al., 2024; Makowski et al., 2022). The autoencoder model totalVI, trained on joint scRNA-seq and CITE-seq data, is able to provide insight into many key variables that would aid antibody development: the identification of novel differentially expressed features to target in a given cell subpopulation, the improved prediction of false positive/negative surface proteins, the reduction of technical bias common in antibody-based measurements such as “background,” and the improvement of experimental design by helping determine optimal antibody titrations/sequencing depths for balancing cost and signal-to-noise ratio (Gayoso et al., 2021). For a full review of computational methods relevant to antibody development, readers are pointed to Kim et al. (2023).

3.4 Machine learning can map and enhance genetic traits to optimize cell lines

Gene editing can be used to enhance or alter cellular traits to generate cell lines optimized for CM production, such as accelerating growth, extending growth (i.e., immortalization), reducing input costs, or tailoring the flavor and nutrition. For example, manipulating cellular metabolism has a large potential for increasing the efficiency (and therefore reducing costs) of CM products (Risner et al., 2021; Humbird, 2021). As another example, Stout et al. engineered cells to overexpress FGF2, eliminating the need for FGF2 supplementation in media through autocrine signaling (Stout et al., 2024). Because many of the species used in CM are under-studied, their genome regulatory networks are not yet well understood, slowing efforts for gene editing. ML offers an opportunity to accelerate gene editing technology.

Genes typically have many regulatory regions, such as upstream and downstream regions, untranslated regions, promoters, and enhancers. These regulatory regions determine where, when, and how much a gene is expressed. Thus, to apply gene editing technology, identifying these regulatory regions is an essential task where ML models have been applied successfully (Ji et al., 2021; Danilevicz et al., 2022; Levy et al., 2022). The most relevant modern ML techniques demonstrated to be useful for these tasks are generative adversarial networks (Wang et al., 2020; Zrimec et al., 2022) and convolutional neural networks (Kotopka and Smolke, 2020). ATAC-sequencing is a very useful data modality for informing on this set of tasks (Yan et al., 2020).

More recently, researchers have adopted transformers and architectures similar to large language models, such as BERT, to segment regulatory regions and predict expression levels (Ji et al., 2021). These models are typically trained on a large number of DNA sequences or similar kinds of data using a supervised fashion by employing a masking strategy to generate a labeled dataset when labels are not available (Ji et al., 2021; Danilevicz et al., 2022). This entails subdividing a DNA sequence into numerous smaller fragments, typically employing a k-mers based approach, where a k-mer is a subsequence of length k within the DNA. Selected k-mers are then masked within the sequence. The objective of the transformer-based model is to predict those masked parts of the sequence while imitating the interactions among different fragments (Ji et al., 2021; Akiyama and Sakakibara, 2022). By predicting the masked fragment, the model learns the underlying structure of the DNA sequences. The resulting output is a set of numerical vectors, often referred to as embeddings, representing each k-mer, encapsulating the sequence information. These vector representations can be further used to fine-tune the model for various downstream tasks, such as segment identification, sequence alignment, and gene expression prediction where limited labeled data is available (Ji et al., 2021; Danilevicz et al., 2022; Levy et al., 2022).

When labeled data is available, convolutional neural networks and transformers can be combined to predict outcomes. This approach employs convolutional neural networks on the small DNA fragments to obtain an initial representation. Following this, the processed fragments are inputted into transformer layers. The transformer architecture finds how each fragment interacts with other fragments and summarizes these interactions in a vector that represents the fragment. The vector representations of all fragments are finally used to predict different genomic tracks (Avsec et al., 2021). Since the vector representations of the DNA fragments derived from the final transformer layer encapsulate information that is both comprehensive and adaptable across various genomic tracks, these vectors can be effectively utilized for predicting a range of genomic tasks beyond those initially targeted.

Since these approaches can be used for gene expression prediction, they are particularly useful for predicting the expression of an edited DNA sequence. Gene editing involves adding, removing, or substituting a segment of DNA to alter the expression of a specific gene. Identification of replacement DNA segments is a complex task that can be addressed by employing an ML algorithm that can estimate the gene expression. Alternatively, using RL, the discussed ML approaches can be trained to generate gene-edited DNA sequences that maximize the expression by following the same training architecture of Large Language Models, such as ChatGPT (Ouyang et al., 2022). Initially, one of these models can be fine-tuned to generate a gene-edited DNA sequence from an input DNA sequence (supervised fine-tuned model). Then, the supervised fine-tuned model can be optimized for predicting gene expression, which will work as the reward function of the RL agent (reward model). Finally, the supervised fine-tuned model can be further optimized by employing RL optimization techniques, such as Proximal Policy Optimization (Schulman et al., 2017), where outputs of the reward model are used as rewards for generated sequences.

4 Media design

Cell culture requires media to keep cells alive, promote growth, and direct differentiation. Culture media contains the nutrients and signals that cells receive in the body and is typically composed of carbohydrates, amino acids, vitamins, minerals, buffers, proteins, peptides, fatty acids, lipids, and growth factors. In addition, media components may also contribute to the flavor of the final product (O’Neill et al., 2021; Simsa et al., 2019). A detailed discussion of how cells utilize these components has been covered previously (O’Neill et al., 2021).

Most existing cell culture media formulations are expensive, not designed for the species of interest for CM production, and include animal-derived ingredients (O’Neill et al., 2021). Media design for CM requires optimizations to address these limitations, which are often complex and resource-intensive research efforts. ML is well suited to accelerate research in this space.

4.1 Culture media design can be treated as a hyperparameter optimization problem

The objective of media design is to identify conditions that are optimized for various parameters, such as price, growth rate, stability, flavor, environmental impacts, and lot-to-lot reproducibility, sometimes optimizing for several of these objectives simultaneously (O’Neill et al., 2021; Hubalek et al., 2022). In addition, media may be optimized for specific cell responses, such as differentiation or spontaneous immortalization (Messmer et al., 2022; Pasitka et al., 2023). In certain cases, media may need to be developed for uncharted parameter spaces; for instance, cells derived from lamb, duck, and deer that were not previously grown for biomedical purposes.

Most culture media contain dozens of ingredients— the classic media Ham’s F-12 contains 47 ingredients (Ham, 1965). Certain ingredients, such as fetal bovine serum or B27 (Brewer et al., 1993), are themselves complex mixtures of additional ingredients. In the absence of ML, single specific improvements of large effect can still be sometimes made to culture media, such as a group that swapped animal albumin for plant albumin (Stout et al., 2022). Big gains in the stability of stem cell media have been made with recombinant peptides (Kuo et al., 2020). Other groups have optimized media, such as bovine myoblast media, using straightforward factorial approaches, and design-of-experiments (DOE) for these approaches are quite mature (Franceschini and Macchietto, 2008; Kolkmann et al., 2022). However, a medium with 49 ingredients, each with five relevant concentrations, has a design space of over 10³⁴ potential recipes, exceeding the capacity of traditional factorial methods (i.e., Taguchi methods; Freddi and Salmon, 2019) or high-throughput screening.

Some attempts have been made to apply ML to the culture media design problem as a means of exploring the experimental search space more effectively. A recent preprint (Hashizume et al., 2022) used gradient-boosted trees to identify culture media that improve the growth characteristics of suspension-phase HeLa cells. Another recent paper trained a neural network to model the results of fractional factorial culture media experiments (Nikkhah et al., 2023). Response surface methodology, a classic DOE method, has been used in conjunction with a genetic algorithm for optimizing media properties such as cost, growth rate, and global warming potential for CM production in both zebrafish fibroblast (Nikkhah et al., 2023) and mouse myocyte (Cosenza et al., 2021) culture media. However, these methods, on their own, aren’t optimal for dealing with the complexity of full culture media recipe design. In addition, both studies use well-characterized cell lines, however their applicability for food production is limited, and further work is needed to translate these methods to more food-relevent species.

Designing culture media is fundamentally a high-dimensional search problem, and Bayesian optimization can effectively navigate problems of this nature, which has the advantages of requiring only a small number of experiments and dealing with uncertainty in a robust manner. Cosenza et al. has used Bayesian optimization to optimize myocyte culture media for CM applications, first demonstrating its increased efficiency over a standard DOE approach (Cosenza et al., 2022), followed by its use to optimize serum-free growth media for CM (Cosenza et al., 2023). However, these studies were performed using the C2C12 mouse myoblast cell line as a model system and further research is required to translate this to a species more relevant to CM production. Outside of CM research, Bayesian optimization has been used to optimize spirulina culture media (Gamble et al., 2021) and keratinocyte differentiation media (Kanda et al., 2022). Google Vizier, which is Google’s hyperparameter tuning service, is a key enabler of Bayesian optimization, as it provides biologists access to this algorithm via API, bypassing the need to code from scratch (Golovin et al., 2024). Additionally, the BoTorch library for Python offers a powerful free toolkit to implement Bayesian optimization (Balandat et al., 2020).

Even given an optimal algorithm for optimization, obtaining and analyzing relevant data from cultured cells remains a challenge in media design. One potential solution is through the use of high-content imaging, which can be enhanced through the implementation of ML methods, as discussed below in section 5. Another approach involves the application of gene network analysis, which may be the most effective method for gaining detailed insights into the cellular response to a given culture medium.

Bayesian optimization is the leading hyperparameter exploration technique, but it stands among others. Genetic algorithms, which cluster parameters in a way that mimics biological chromosomes to enable recombination (Katoch et al., 2021), have been used to optimize cyanobacteria media (Havel et al., 2006). Simulated annealing, which is based on a metallurgical concept (Delahaye et al., 2019), has been used in conventional animal agriculture to optimize poultry feed (Wijayaningrum et al., 2017) and to produce a population balance model for CHO cells (Wijayaningrum et al., 2017). However, Bayesian optimization appears to be the most widely used among these techniques, probably because it can work on non-differentiable objective functions without the difficulty of partitioning parameter space into somewhat arbitrary and finicky “chromosomes” (Qualities, 2020).

4.2 Machine learning can interpret how cells respond to media conditions

The heart of media design is attempting to alter the properties of cells by altering their environment. Assessing the properties of cells can be challenging—although RNA-seq can provide insight on the inner state of cells by quantifying expressed genes (Messmer et al., 2022). However, that information is often not readily interpretable because it is high-dimensional and the effect of any single gene is often esoteric or context-dependent. ML-guided network analysis is a solution - it can assess changes to the inner state of cells when performing media design or other optimizations, such as culture vessel geometry or atmosphere composition. Both gene-based network analysis and metabolite-based network analysis can be used for this application.

Gene network analysis has not yet been directly used in CM media research, but it has been used in parallel applications. Network analysis has been used to improve mammalian culture, CHO cells in particular, to identify feed supplements (Schinn et al., 2021). Network analysis has also been used to find gene networks important for feed efficiency or milk production in cattle, with the networks formed from phenotype–genotype analysis (Suchocki et al., 2016; Taussat et al., 2020).

To get more details about a gene network, gene network inference (GNI) methods can deduce the regulatory interactions between genes. The potential value of GNI to CM media design is that GNI can aid and simplify the interpretation of an RNA-seq data matrix; instead of looking at the differential expression of tens of thousands of genome features, the analyst can look at the differential activity of dozens of network modules. Among recent GNI methods are dynGENIE3 (random forests) (Huynh-Thu and Geurts, 2018), Scribe (Qiu et al., 2020), and TENET (Kim et al., 2021), all of which are benchmarks for the most recent papers (Liu et al., 2022). GNI has been used to optimize output in traditional agriculture, both soybean cultivation (Hale et al., 2022) and maize cultivation (Huang et al., 2018), and GNI could be used for analogous output maximization problems in CM.

In biotechnology, metabolic network analysis has, historically, most commonly been done with flux balance analysis to predict the inputs and outputs of metabolites in cells (Orth et al., 2010). On its own, flux balance analysis has been used for CM-relevant problems like modeling CHO metabolic states (Hagrot et al., 2017), evaluating metabolic hypotheses (Martínez-Monge et al., 2019), and developing genome-scale metabolic models (Széliová et al., 2020). However, flux balance analysis has also been combined with ML in hybrid systems for purposes such as correlating real-world data (Wu et al., 2024) or determining the most relevant model features (Vijayakumar et al., 2020). Given the use of flux balance analysis in CM and the demonstrated ability of ML to enhance it, this may be a useful application of ML to CM going forward.

4.3 Optimizing and engineering proteins is much easier with machine learning

Another opportunity for optimization is in the ingredients themselves. Growth factors are currently estimated to contribute over 95% of the total production cost (Specht, 2020). Altering proteins for use in cell culture media is a promising means to optimize the stability, price, or other desirable properties of these ingredients.

Recombinant production, a commonly used animal-free strategy to produce proteins, is a major contributor to media costs (Venkatesan et al., 2022). Most proteins present in culture medium are highly specialized eukaryotic proteins, requiring post-translational modifications such as phosphorylation or glycosylation that require mammalian cell culture systems, rather than the far cheaper and more scalable bacterial production systems. Recent work developed an E. coli expression system to reduce the price of several relevant growth factors (Venkatesan et al., 2022). Similar efforts are necessary to reduce the cost of other media ingredients.

Probably the biggest challenges when importing transgenic proteins are differential folding between species (i.e., different chaperones, different organism temperatures) and differential post-translational modification. There have been many efforts to use ML to predict post-translational glycosylation, with earlier efforts using random forests (Hamby and Hirst, 2008) and more recent efforts using multi-layer perceptrons (Pakhrin et al., 2021) or ensemble models (Ching-Hsuan et al., 2020). As for differential folding, transformer-based models have been used to predict thermal protein stability (Jung et al., 2023). Combining transformers and GNNs has been used to predict protein subcellular localization (Dubourg-Felonneau et al., 2022). More recently, AlphaFold has been used to predict protein stability in a general sense (Pak et al., 2023), and ColabFold was developed to improve upon it (Mirdita et al., 2022).

Protein ingredients can also be engineered to optimize their stability in culture, reducing the total amount of protein needed for CM production (Goldenzweig and Fleishman, 2018). For example, FGF2 thermal stability was increased using point mutations in the protein (Dvorak et al., 2018). Similarly, Long R³ IGF-1 is a modified recombinant form of IGF that prevents inactivation by IGF binding factors, leading to x200 potency and x3 stability compared to standard insulin (Voorhamme and Yandell, 2006). Similar strategies could be applied to other expensive media ingredients to reduce the necessary concentrations or additions. RFdiffusion (based on a generative diffusion model) (Watson et al., 2022) and ProteinMPNN (Dauparas et al., 2022) are tools designed to solve this problem. ML is also used in directed evolution experiments that iterate upon a given protein’s design (Hie et al., 2024; Saito et al., 2021).

For some proteins used in high concentrations, such as albumin, transferrin, and insulin, replacement with lower-cost alternatives, such as plant (Humbird, 2021; Okamoto et al., 2022) or microbial (Tuomisto and Teixeira de Mattos, 2011; Jeong et al., 2021) hydrolysates may be beneficial. For example, a protein similar to bovine insulin was found in cowpea (Vigna unguiculata) and could be isolated for use as an insulin replacement (Mj et al., 2015; Venâncio et al., 2003). Screening plants or microbes for sequence homology to proteins of interest could accelerate this work. The traditional, non-ML tool for homology searches is protein BLAST (Altschul et al., 1990), but BLAST is underpowered because proteins can have similar structures without having similar sequences. Services like Dali provide protein structure alignment but are constrained by the paucity of known protein structures and the computational complexity of three-dimensional comparisons (Holm and Laakso, 2016). ML has changed how to approach this problem with FoldSeek (van Kempen et al., 2023), which uses autoencoder-based compression to simplify three-dimensional protein structural comparisons, using AlphaFold to infer protein structures when no empirical data is available (Jumper et al., 2021).

5 Cell culture microscopy and image analysis

Microscopy is one of the foundational techniques of cell culture, providing information such as (i) the health of cells (e.g., whether they are mitotic, senescent, or apoptotic); (ii) the behavior of cells (e.g., whether they are invasive, contractile, or secretory); and (iii) the lineage of cells (e.g., whether they are stem cells, progenitor cells, or terminally differentiated cells). Many types of cell analysis rely on microscopy. For example, fusion index, the percentage of nuclei inside myotubes using immunostaining, is the predominant test used to quantify myogenic differentiation (Ben-Arye and Levenberg, 2019). Alternatively, inexpensive colorimetric staining has been shown as an alternative measure of myotube differentiation (Veliça and Bunce, 2011). Additionally, brightfield imaging, without the use of dyes, has been used to measure cell contractility, which is a measure of muscle cell maturity (Furuhashi et al., 2021; Ribeiro et al., 2017).

The relevance of microscopy during large-scale production will be especially dependent on its ability to serve as a low-cost, high-throughput tool. However, microscopy has been historically constrained by the complexity of its analysis. Microscopy analysis is routinely done manually by researchers with well-trained eyes, and doing it automatically requires systems that can incorporate many nuanced features of the image data. Unfortunately, appropriate systems tend to be nascent, poor, or non-existent in biological research. Furthermore, the use of dyes to improve image quality is undesirable due to the cost and time constraints of CM production. For CM production, cell segmentation and classification are fundamental and indispensable due to their multifaceted contributions - they are relevant to quality control, monitoring of cell culture health, and optimizing production. Utilizing ML approaches for automated cell segmentation and classification can lead to a reduction in the time, expenses, and errors involved in preparing the setup for manually analyzing the image data.

5.1 Automatic image analysis for cell segmentation using machine learning

Cell segmentation is the process of identifying and separating individual cells within an image. This is done in digital microscopy and histopathological imaging to study cell structure and function. The goal of cell segmentation is to pick out cells in an image, and segmentation is necessary to measure cell size, shape, and number. Segmentation also enables the tracking of individual cells over time, allowing researchers to assess changes in cell behavior and morphology, which can help in optimizing conditions for CM production. Furthermore, cell segmentation can also help in identifying contaminating debris or inappropriate cells from the culture, ensuring the quality and safety of the final product.

Recently ML has been used to improve cell segmentation for various applications (Al-Kofahi et al., 2018; Durkee et al., 2021; Kumar et al., 2020; Pachitariu and Stringer, 2022). Manual identification of individual cells has poor reliability between and within human evaluators. Additionally, automating cell imaging tasks can free up researchers’ time for higher value tasks and reduce errors from fatigue and subjectivity (Verma et al., 2021). However, there are several challenges that hinder the effectiveness of ML-based cell segmentation. First, cells are highly variable in size, shape, and morphology. Second, segmentation may fail on images that are low-contrast, have uneven illumination, or are out-of-focus. Third, crowded and overlapping cells can make it difficult to distinguish individual cells using cell-segmentation algorithms. Fourth, training data can overfit on imaging artifacts, such as non-uniform illumination and background noise (Zinchuk and Grossenbacher-Zinchuk, 2023). These challenges are not unique to CM but must be taken into account prior to applying ML for cell segmentation in any field.

In order to overcome the above challenges and to develop robust algorithms for detecting and segmenting cells, the computer vision community requires access to large, diverse, and well-curated datasets with comprehensive annotations. While some public datasets for nuclei and cell segmentation have been released in the past (Kumar et al., 2020; Verma et al., 2021; Caicedo et al., 2019; Greenwald et al., 2022; Kaimal et al., 2021; Naylor et al., 2019), additional datasets specific to CM are necessary to develop such robust algorithms for accurate cell segmentation in microscopic images.

A classic method for cell segmentation is the watershed algorithm combined with optimal thresholding, which has been used in applications such as segmenting lymphocytes into nuclear and cytoplasmic regions (Mohammed et al., 2013). There has been growing interest in deep neural net architectures inspired by fully convolutional networks for cell segmentation (Kumar et al., 2020; Naylor et al., 2019; Gómez-de-Mariscal et al., 2021; Long and Shelhamer, 2015; Wienert et al., 2012; Yu et al., 2016). These architectures employ encoder-decoder blocks to transfer features from multiple scales and levels for efficient cell segmentation on histopathology and microscopy images. The U-Net model, a variant of the fully convolutional network architecture, has shown particular promise for this task (Falk et al., 2019; Ronneberger et al., 2015). Unlike fully convolutional networks, U-Net incorporates skip connections that facilitate precise semantic segmentation by amalgamating features from diverse resolutions, enhancing the model’s capability to capture intricate details. Similarly, U-Net++ employs advanced encoder-decoder structures and loss functions to enhance performance for cell segmentation (Zhou et al., 2018). Such deep learning architectures have outperformed pathologists’ performance for cell segmentation in various applications (Franklin et al., 2021; Hekler et al., 2019).

5.2 Automatic image analysis for cell classification using machine learning

Cell classification is another important task in image analysis. While cell segmentation is the process of separating individual cells from the background and from each other in an image, cell classification is the process of assigning labels to the segmented cells based on their morphology, phenotype, or function (Dursun et al., 2023; Huynh et al., 2021). The primary goal of cell classification is to assign each segmented cell to a specific category or label, such as cell type, state, or condition.

Manual analysis of thousands of microscopy images for cell classification is a tedious and error-prone task. Thus, there is a need for ML algorithms for automatic cell classification. Recently, ML has been used for the classification of cells within microscopy images, including the identification of anemia or blood disorders based on the shape, size, and optical properties of red blood cells (Belashov et al., 2021), and for the analysis of cellular microenvironments to offer novel insights into biological mechanisms (Winfree, 2022). However, the effectiveness of ML-based cell classification for microscopy faces several challenges. One challenge arises from the crowded or overlapping cells, making it difficult to distinguish individual cells. Differences in cell maturity and variations in cell shape resulting from diverse cultivation and treatment methods can introduce complexity into the analysis, especially when distinguishing between different types of cultured cells. Complex textures, patterns, and shapes in multi-modal microscopy further complicate cell classification, requiring the integration of multi-stream models that consider low-level cues such as edges and gradients (Lou et al., 2023). The intricate cell structures found in tissue images introduce additional complexities such as heterogeneous cell populations and staining variations. These challenges necessitate the development of precise and efficient algorithms for cell classification.

In the domain of microscopy image analysis, many innovative architectures and methods have emerged for classifying cells. Among these, CNNs stand as a robust choice, with models like U-Net (Ronneberger et al., 2015) and Mask R-CNN (He et al., 2015) excelling in cell classification tasks. RNNs, particularly long short-term memory and gated recurrent units, demonstrate their prowess when handling sequential data, making them valuable for tracking cell dynamics (Ghojogh and Ghodsi, 2023). For tasks involving complex relationships between cells, GNNs, such as graph convolutional networks, prove invaluable (Chen et al., 2022). Traditional approaches like random forests and decision trees, which are based on hand-picked image features, are still relevant for classification because they are computationally fast, easy to train, and resist overfitting (Gurcan et al., 2009; Kumar et al., 2022). Transfer learning techniques harness pre-trained deep learning models like ResNet (He et al., 2015), while attention mechanisms and ensemble methods contribute to improved accuracy (Marzahl et al., 2019). With the ever-evolving landscape of microscopy, these versatile architectures continue to play pivotal roles in cell classification.

Following cell classification, cell phenotype analysis is pivotal in exploring cellular characteristics, encompassing physical and biochemical attributes such as size, shape, function, viability, proliferation, signaling, and morphological structure. This approach is particularly valuable in culture media optimization (Zhou et al., 2023). By scrutinizing the physical attributes and functional behavior of cultured cells, researchers can ensure the consistency and quality of cell populations, evaluate metabolic activity, and assess cellular functionality within the culture. Moreover, the examination of cell morphology offers insights into cellular health and enables fine-tuning of culture media formulations (Zhou et al., 2023; Grzesik and Warth, 2021). ML has been used to combine different forms of microscopy using transfer learning - such as between fluorescence and dye-free microscopy (Jang et al., 2021) or between traditional microscopy and mass spectrometry imaging (Race et al., 2021). Thus, in the burgeoning field of CM, phenotype analysis contributes to the refinement of CM production, guiding the selection of cell strains and culture conditions to achieve better quality and desired growth attributes. Overall, cell phenotype analysis plays a pivotal role in enhancing cell culture processes and product quality, making it a valuable tool for CM production.

6 Bioprocess and food processing optimization

Moving CM from lab bench scale to commercial scale requires efficient bioprocess design. This centers around the use of large bioreactors to produce a controlled environment for cell growth and differentiation that maximizes biomass and minimizes by-product yields. A variety of bioreactor types have been proposed for use in CM, which are reviewed elsewhere (Allan et al., 2019). One study estimated that producing 1 kg of protein from muscle cells would require stirred tank bioreactors on the order of 5,000 L (Stephens et al., 2018). This dwarfs research-scale mammalian cell culture and will require extensive optimization. In addition, after harvesting cells from the bioreactor, CM products will likely require food processing steps to create a final product. ML is well suited to increase the scale and efficiency of CM bioprocessing and food processing in a variety of ways.

6.1 Bioreactor homeostasis can be maintained with machine learning

As the culture scale is increased, automation of closed systems will become important to increase efficiency and reduce contamination or other failure events (Specht et al., 2018). Real-time quality assurance and process monitoring will allow for adjusting culture conditions to optimize yield, monitoring for potential contamination, and reducing human errors. Automation in processes such as media recycling will further optimize the process and bring costs down.

Historically, mammalian bioreactor operation has been governed by systems such as proportional-integral-derivative controllers (Synoground et al., 2021) or model predictive control (Sarna et al., 2023). These systems, although mainstays of control theory, are designed to work in deterministic environments that can be well-described by linear differential equations. Although these systems can struggle with the vagaries and unpredictability of biological systems, they have been successfully employed in bioreactors to maximize antibody production (Kiparissides et al., 2015), constrain overflow metabolism (Bogaerts et al., 2017), and maintain glucose homeostasis (Craven et al., 2014). This type of modeling has ceded ground to ML-based modeling in recent years, probably because, compared to ML, these models are relatively fragile because of their dependence on their top-down mathematical models accurately describing reality.

The most straightforward application of ML to bioreactors is to use ML-based models to control the bioreactor’s inputs. A wide gamut of ML models have been used to monitor and control bioreactors and industrial bioprocesses, especially supervised learning models like neural networks (Zavala-Ortiz et al., 2022), random forests (Vaitkus et al., 2020), and gradient boosting (Zhang et al., 2021). The step beyond using ML to optimize models is to use ML to optimize the policies themselves that determine the models, which is done by reinforcement learning, with examples in bioreactors ranging across deep Q-networks (Oh et al., 2022), policy gradients (Petsagkourakis et al., 2020), and probabilistic Bayesian optimization (Luna and Martínez, 2014). These policy-based learning methods do not necessarily require a prior understanding of the biochemistry of the bioreactor.

6.2 The unique challenges of structured products can be addressed by machine learning

As opposed to ground meat, structured tissues (i.e., a steak or fish filet) will require the formation of organized 3-dimensional tissues and bioreactor systems capable of supporting them. A structured product entails growing an organized and three-dimensional tissue, as opposed to growing cells in suspension - the difference between growing cells as a soup and growing cells as a steak. This type of engineered tissue has not been shown on any scale beyond a tissue for a single patient, making this a major whitespace in the field (Specht et al., 2018).

Imposing organization upon cells is a classic, challenging problem from the field of tissue engineering, and solutions often involve ML. Random forests (Conev et al., 2020) and neural networks (Bone et al., 2020) have been used to predict optimal parameters for the extrusion printing of hydrogel scaffolds. Gradient boosting has been used to predict the self-assembly of dipeptide-based hydrogel scaffolds (Li et al., 2019). In theory, ML could be used to design the structures of tissue scaffolds, but this avenue appears to be unexplored at this time. Further review of the application of ML for bioprinting has been completed previously (Ng and Tan, 2024).

Tissue self-organization is a powerful principle for structured meat production. Tissue scaffolds or extrusion printing are currently used to create top-down structured tissues because of a lack of understanding of bottom-up self-organization of tissue in vitro. However, breakthroughs in bottom-up tissue self-organization would accelerate the scale-up of CM, eliminating the costs of scaffolds. In the context of CM, relevant types of self-organization are the alignment of fibers (especially myofibrils and collagen cables), the production of functional vasculature networks, maintaining ratios of meat-relevant cell types (myocytes, adipocytes, fibroblasts), and the structural determinants of texture and mouthfeel (Nishimura, 2010). Attempts have been made to model tissue self-organization using differential adhesion (Cerchiari et al., 2015), cellular Potts models (Libby et al., 2019), and agent-based modeling (Wang et al., 2020). Accurate models of tissue self-organization may provide design principles for making tissues with defined structure in CM.

A family of methods that may help with structured tissue construction in CM is spatial transcriptomics, which correlates gene expression in situ to physical coordinates within a tissue section (Tian et al., 2023). Spatial transcriptomics has been used for understanding specific aspects of tissue structure and cell organization that rely on spatial context, such as extracellular forces and gradients of signaling molecules (Heumos et al., 2023). GNNs have been successfully used with spatial transcriptomics data to model cellular communication (Fischer et al., 2023; Hu et al., 2021; Tanevski et al., 2022) and the deep learning model Tangram has also been used to resolve cell types and decrease imputation error from spatial data (Biancalani et al., 2021). Although spatial transcriptomics has not yet been used in the context of CM, it could prove useful in dissecting the complex tissue architectures involved in structured products.

Structured products will also require bioreactor systems capable of perfusion and harvesting of large intact tissues. The nature of a structured product bioreactor bears similarity to the fluidized and/or packed bed bioreactors that are used industrially in the wastewater and mineral extraction industries. Numerical fluid dynamical models are classically used to predict the behavior of these systems, but ML has, in recent years, been used for the more unpredictable aspects of these systems (Koerich et al., 2018; Ouyang et al., 2018). In particular, gradient boosting has been used both to predict bed expansion of fluidized bed bioreactors (Peng et al., 2022) and mass transfer in packed bed bioreactors (Guo et al., 2023). Similar methods may be useful for the particular challenge of CM structured product bioreactors.

6.3 Real-time sensory prediction and control could be applied with reinforcement learning

Unlike tissue engineering for medical treatments, the sensory properties of CM, such as flavor and texture, are critical to its commercial success. These may be generated during the cell culture from flavor or texture components of cells, media, or scaffolds, or after harvest using food processing or additives. ML is playing an important role in enhancing the analysis of the flavors and textures of other food products through the analysis of diverse data types and these techniques are likely to play a role in CM development as well.

During product development, flavor can be measured in a variety of ways. Earlier ML models used data from gas chromatography–mass spectrometry, which is an analytical chemistry method used to separate and fingerprint substances from complex mixtures (Bi et al., 2020; Zhu et al., 2021). Later, researchers focused on electronic noses that mimic the olfactory capability of humans through different sensors. Since an electronic nose can be used to collect real-time sensor data during production, ML can also be applied in real-time for quality and flavor control (Gonzalez Viejo et al., 2021; Gonzalez Viejo et al., 2020; Tian et al., 2020). However, it should be taken into consideration that many compounds important for the aroma of meat are generated during cooking (Khan et al., 2015). The latest research efforts concentrate on utilizing the molecular structure and physicochemical attributes of flavor compounds to predict flavors, including taste or smell (Bouysset et al., 2020; Wiltschko, 2019; Lee et al., 2022; Tuwani et al., 2019; Wang et al., 2021). These characteristics are quantified into molecular descriptors, numerical representations that encapsulate the properties of the molecules involved. These descriptors then serve as inputs for ML models, which are trained to predict flavor profiles and odor characteristics with greater objectivity.

Since most of these data are tabular, a wide range of traditional ML approaches, such as support vector machines, random forests, k-nearest neighbor, and AdaBoost Tree, have been employed (Lee et al., 2022; Wang et al., 2021; Ji et al., 2023). Additionally, deep learning approaches, such as CNNs (Bi et al., 2020) and multilayer perceptrons (Zhu et al., 2021; Gonzalez Viejo et al., 2021; Tian et al., 2020), and unsupervised learning approaches, such as cluster analysis by using principal component analysis (Gonzalez Viejo et al., 2021; Tian et al., 2020), have also been applied to identify or predict flavors. These methodologies have collectively demonstrated that ML can significantly contribute to the enhancement of the sensory properties of a range of food products.

AI approaches, particularly RL, offer significant potential for enhancing efficiency in food processing operations (Petsagkourakis et al., 2020; Aljaafreh, 2017; Bi et al., 2020). Food-processing facilities are typically equipped with a variety of sensors, including those for temperature, pressure, moisture, and pH levels. These sensors play a crucial role in ensuring precise ingredient measurements, which are fundamental for achieving the desired taste and aroma profiles of food products. However, the challenge arises when new products are developed, as determining the optimal mixture of ingredients often involves extensive trial and error. Deploying an RL agent in this context can effectively manage and adjust the various sensor readings, thereby ensuring that the food consistently meets the specific standards and requirements set by the food processor. This approach not only streamlines the development process but also enhances the precision and quality of the final product. RL can also be valuable for sensory prediction, in particular the texture prediction of finished CM products (Kircali Ata et al., 2023).

7 Discussion

In the last decade, the field of CM has made strides toward lower-cost and more efficient production processes but must progress significantly further to effectively rival traditional meat. ML offers great promise in improving every stage of CM production, from cell line development to the final product’s sensory characteristics. With the current surge in both public and private research and development for both the ML and CM fields, and the already successful integration of ML into numerous life sciences fields, the integration of ML into CM research is timely. Despite the numerous opportunities, there are only a handful of peer-reviewed, publicly accessible studies that describe the use of ML in CM production (Table 1), and an equally small group of researchers versed in both ML and CM. This review aims to bridge the gap between the ML and CM fields and create a starting point for scientists to better understand how to apply ML to CM research. To our knowledge, this is the first review to provide a comprehensive overview of the applications of ML to CM, covering the topics of cells, media, microscopy, bioprocess, and final product properties.

Since ML has been successfully employed in many other bioinformatics sectors, many of the existing methods can be adopted to CM. However, the key limitation is the availability of sufficient data. Creating ML models requires large training and validation datasets, and ML models are only as robust and reliable as the amount and quality of data used to develop them (Priestley et al., 2023). The scarcity of public data in CM complicates the development of models or even the assessment of potential model types. There are some publicly available dataset repositories, such as the Gene Expression Omnibus (GEO) for RNA-seq data,¹ GenBank for sequenced genome data,² and Uniprot for protein sequences.³ However, compared to species used in medical studies, few CM-relevant datasets are reported and many lack adequate descriptions, data annotations, or samples and replicates to be considered for statistical analysis. As an example, a query for “stem cell” in the GEO DataSets generates 71,327 datasets for Homo sapiens (human) and only 61 for Bos taurus (cattle) (as of February 1, 2024). Efforts to generate properly annotated data and incentives for the sharing of data from ongoing experiments would greatly accelerate the application of ML to CM research. The Cultivated Meat Modeling Consortium⁴ offers a model for community-generated and shared data, while protecting intellectual property, to accelerate computational models.

A survey of datasets with potential relevance to CM research is included in Supplementary Table S1. The dataset survey encompasses a compilation of open-access biological datasets derived from CM-relevant species such as fish, crustacean, mollusks, cow, pig, and chicken. These curated datasets span a diverse array of sources including sequencing (RNA, ATAC, ChIP, single cell, and genome), mass spectrometry (proteomics, lipidomics, and metabolomics), and microarray experiments.

Transfer learning might be able to partially make up for the scarcity of CM-relevant data, by using models from data-rich biomedical species, such as humans and mice, to inform models for data-poor CM-relevant species. Cross-species graph-based transfer learning has previously been applied in a non-CM context on RNA-seq data for cell-type identification (Liu et al., 2023; Wang et al., 2024). Challenges to this approach include biological heterogeneity from differing sets of genes or differing functions for genes across species, which could potentially be mitigated through techniques such as universal cell embeddings (Park et al., 2024; Rosen et al., 2024). As a starting point, researchers may look into fine-tuning large scale models that have been successful in achieving improved performance on biological tasks relevant to CM with limited task-specific data, including gene network analysis and cell type annotation, such as Geneformer and scGPT (Cui et al., 2024; Theodoris et al., 2023).

Another challenge relates to the scale of existing studies. Most ML research related to CM has been limited to laboratory environments, which might not mirror the conditions of mass production. Solutions could include either verifying these lab models at a commercial level or using process simulations to adapt them for larger operations.

From the perspective of biologists, an important way to speed up ML work is to produce the biological models that are used to generate data for training ML models. For example, the production and dissemination of more high-quality cell lines from agriculturally relevant animals is needed to aid the generation of omics and microscopy data. Furthermore, biological scientists can make efforts to contribute to the body of existing data by publishing any quality data that results from their experiments, whether or not they are directly used in their own studies, including negative results. Publication of transcriptomic and epigenomic data has become more commonplace, however complete microscopy image sets, in particular, are rarely published. There is also a need for more proteomic and metabolomic data, especially in understudied food-relevant species, to address numerous use cases including metabolic modeling, flavor profiling, and bioreactor scaling concerns (Nissa et al., 2022). Moreover, datasets should be properly annotated - this first includes metadata describing the samples the data came from, how the data was generated, and any processing that was done to the data. Ideally, data would include all raw data, which ML scientists could use to regenerate the original dataset. Similarly, properly labeled data (labeling of individual data points) is important for supervised learning models, which remain the most popular and widely used (Larrañaga et al., 2006). Finally, the quality and consistency of datasets are also critical for their use in training ML models. Variations in sample handling and data collection can lead to large and varied systematic errors that make it challenging to usefully combine multiple datasets or for a model trained on one dataset to apply to another (Priestley et al., 2023).

Efforts to make ML more accessible to researchers, such as infrastructure, frameworks, benchmarks, and libraries, could help facilitate the application of ML to CM. Currently, a variety of ML models are freely available from online resources like Paperswithcode⁵ and Hugging Face.⁶ Additionally, libraries such as Python’s scikit-learn (Grisel et al., 2024) or frameworks such as PyTorch Lightning⁷ offer a good starting point for coding ML. Similarly, accessible web interfaces could make ML tools more accessible to biologists, following the examples of AlphaFold and Foldseek, which both have interfaces integrated into the widely used online protein database UniProt.

Overall, there is an enormous opportunity for CM researchers to incorporate ML techniques and for ML professionals to explore the CM field. The ML field has been moving at unprecedented speed over the past few years, and CM researchers could make gains simply by porting over what is, in effect, yesterday’s news in ML. This review aims to be an introductory resource for researchers eager to explore this cross-disciplinary method, which could help to establish CM as a viable and sustainable protein source in our diets.

Author contributions

MT: Conceptualization, Writing – original draft, Writing – review & editing. SJ: Conceptualization, Visualization, Writing – original draft, Writing – review & editing. RV: Conceptualization, Writing – original draft, Writing – review & editing. RS: Visualization, Writing – review & editing. KS: Visualization, Writing – review & editing. BD: Conceptualization, Project administration, Writing – original draft, Writing – review & editing.

Funding

The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. BD, MT, JS, and RV were funded by a grant from Schmidt Futures (SF ID 2022-10-17-8).

Acknowledgments

The authors would like to thank Zachary Consenza, Roy Nadler, Amin Nikkhah, Dave Staszak, Charlie Taylor, Sofia Giampaoli, Animesh Acharjee, Ana Velez Rueda, Jessica Carballido, and Leandro Sommese for reviewing early drafts of this manuscript and providing valuable insights. We thank Isha Datar and Cam Linke for their assistance in finding funding for the project. We would also like to thank Evan Rapoport for his facilitative leadership on this project.

Conflict of interest

Author MT was employed by company Todhunter Scientifics.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/frai.2024.1424012/full#supplementary-material

Abbreviations

AI, artificial intelligence; CHO, Chinese hamster ovary; CM, cultured meat; CNN, convolutional neural network; FAO, Food and Agriculture Organization; GAN, generative adversarial network; GNI, gene network inference; GNN, graph neural network; ML, machine learning; NLP, natural language processing; OECD, Organization for Economic Co-operation; RL, reinforcement learning; RNN, recurrent neural network; VAE, Variational Autoencoder.

Footnotes

1. ^https://www.ncbi.nlm.nih.gov/geo/

2. ^https://www.ncbi.nlm.nih.gov/genbank/

3. ^https://www.uniprot.org/

4. ^https://thecmmc.org/

5. ^https://paperswithcode.com/

6. ^https://huggingface.co/

7. ^https://lightning.ai/

References

Akiyama, M., and Sakakibara, Y. (2022). Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning. NAR Genomics Bioinform 4:lqac012. doi: 10.1093/nargab/lqac012

PubMed Abstract | Crossref Full Text | Google Scholar

Alharbi, F., and Vakanski, A. (2023). Machine learning methods for cancer classification using gene expression data: a review. Bioengineering 10:173. doi: 10.3390/bioengineering10020173

PubMed Abstract | Crossref Full Text | Google Scholar

Aljaafreh, A. (2017). Agitation and mixing processes automation using current sensing and reinforcement learning. J. Food Eng. 203, 53–57. doi: 10.1016/j.jfoodeng.2017.02.001