
94% of researchers rate our articles as excellent or good
Learn more about the work of our research integrity team to safeguard the quality of each article we publish.
Find out more
ORIGINAL RESEARCH article
Front. Genet. , 03 March 2025
Sec. Statistical Genetics and Methodology
Volume 16 - 2025 | https://doi.org/10.3389/fgene.2025.1504443
This article is part of the Research Topic Statistical Approaches, Applications, and Software for Longitudinal Microbiome Data Analysis and Microbiome Multi-Omics Data Integration View all 8 articles
There is rising interest in using longitudinal microbiome data to understand how the past status of the microbiome impacts the current state of the host, referred to as “time-lagged” effects, as these effects may take time to occur. While existing works used previous states of the microbiome in their analysis, they did not use methods that identify both the time-lagged associations and their corresponding time lags. In this article, we present a framework to identify time-lagged associations between abundances of longitudinally sampled microbiota and a stationary response (final health outcome, disease status, etc.). We start with a definition of the time-lagged effect by imposing a particular structure on the association pattern of longitudinal microbial measurements. Using group penalization methods, we identify these time-lagged associations including their strengths, signs, and timespans. Through simulation studies, we demonstrate accurate identification of time lags and estimation of signal strengths by our approach. We further apply our approach to find specific gut microbial taxa and their time-lagged effects on increased parasite worm burden in zebrafish.
The study of the gut microbiome is critical for understanding health and disease (Gomaa, 2020; Afzaal et al., 2022). The gut microbiome is not a static system; it has complex dynamics and the composition shifts constantly (Gerber, 2014). Changes to the gut microbiome can occur due to diet, medical interventions (i.e., antibiotic usage), and health status, among other drivers. Sampling the microbiome longitudinally (at several points in time) uncovers potential temporal variations in the microbiome, which can provide a full understanding of the ecosystem (Grieneisen et al., 2023). The dynamic aspect of the microbiome/host relationship is understudied and there is a need for analytical approaches that can handle this kind of complex data.
While many studies have investigated the link between the gut microbiome and health outcomes using a static snapshot of the gut microbiome and host health status, there is a growing recognition of the importance of incorporating longitudinal data. By examining the dynamic structure of the microbiome and its associations over time, researchers can discern patterns that may not be apparent in single-time point analyses or that may take time to develop. A prior state of the gut microbiome may be as or more informative of the current health of a patient than the current state of the gut microbiome. These connections are not necessarily immediate and may take time to occur. We call the association of a previous state with the current state a time-lagged association or a time-lagged effect. One famous instance of the time-lagged microbiota-host association is the long-term health and disease outcomes that are associated with the infant microbiome (Sarkar et al., 2021).
Identifying time-lagged associations/effects helps uncover the unique dynamics of the microbiome. Biological responses to changes in the microbiome do not necessarily appear immediately. Instead, these responses may manifest after some lag, creating a time-lagged association. A biological response associated with some disruption or change in the microbiome may not be observable until weeks later. By pinpointing the timing of these lagged associations, we gain insight into when interventions could be introduced for maximum effect. This approach allows for better-informed, time-specific strategies in microbiome research, leading to more effective interventions and a deeper understanding of host-microbiome interactions.
Multiple studies have been conducted to identify time-lagged microbiota-host association. Wilmanski et al. (2021) linked low gut microbiome uniqueness and high relative Bacteroides abundance to decreased 4-year survival in older healthy adults using Cox proportional hazard regression models. In a study to identify associations between the gut microbiome and nestling weight and survival in wild great tits, Davidson et al. (2021) identified specific microbial ASVs associated with surviving to fledgling using data from day 8 post-hatch, as well as specific ASVs associated with non-survival. Luna et al. (2020) developed a joint modeling framework to detect associations between longitudinal microbiome count data and time-to-event outcomes. They applied this method to analyze longitudinal samples of pregnant women and found that a 10% increase of the genus Prevotella was associated with a 1.5-fold increase in hazard of delivery. These studies show the wide interest and great potential in using longitudinal microbiome data to understand how the past status of the microbiome impacts the current state of the subject. However, none of these studies have used methods that identify both the time-lagged associations and their corresponding time lags, although they included previous states of the microbiome in their analysis. These studies showcase the need for more tailored methods.
In this work, we introduce a novel framework that identifies the lags and associations of specific taxa with a response, utilizing penalized group selection methods. The group selection methods identify taxa as well as associated time points to form their specific lagged associations. We apply this framework to real data of longitudinally sampled zebrafish gut microbiome and host parasite infection (Hammer et al., 2024). These data were originally collected to investigate the links between the microbiome, parasitic infection, and intestinal metabolites. Hammer et al. (2024) found the amount of microbiome disruption in parasite-infected zebrafish was correlated with parasite infection severity. Our analysis further found genera that were also identified as microbial mediators for the metabolome by Hammer et al. (2024) as well as additional genera worthy of further exploration to understand the gut microbiome-parasite burden link.
In this section, we formally define a time-lagged effect and its corresponding time lag. For each of the
where
In this work, we focus on identifying and estimating time-lagged associations between the covariates and the response. We start with the definition of the time-lagged effect of a covariate on the response. To facilitate the definition, we illustrate the underlying relationship between two covariates and the response over time in Figure 1. In Figure 1, we lay out the repeated measures of two covariates
In Figure 1, on the one hand, we say that
On the other hand,
Based on the above discussion, we formally define the time-lagged effect of a covariate
In other words, if
The time-lagged effect of microbiome on host status is not uncommon in practice. For example, our real data analysis identifies several microbial taxa in the gut that have a variety of lagged associations with zebrafish parasite worm burden. In particular, abundances of genus Chitinibacter were found to have a 29-day-lagged association with parasite worm burden, whereas abundances of genus Mycobacterium were found to have an instantaneous association with parasite worm burden. More details of such findings can be found in the real data section.
Due to the correspondence between the time-lagged effect and the sparsity pattern of the grouped coefficients in (2), identifying and estimating time-lagged effects can be regarded as a grouped variable selection problem. To identify these effects, we aim to pinpoint which time points have non-zero coefficient values, indicating a true association with the response. This process parallels traditional variable selection, where the objective is to determine which covariates contribute to the model. By considering the measurements of a covariate at different time points as a group, we treat the identification of time-lagged effects as a grouped variable selection problem, often solved using group penalization approaches, ensuring that only the relevant time points are retained in the model.
In the group penalization framework, we estimate the parameters
where
In this subsection, we briefly review existing grouped variable selection methods as well as their corresponding penalty functions
Group-level selection methods select groups of variables and bi-level selection methods select both groups of variables and individual variables within a group, to represent their associations with a response. A well-known representative of group-level selection methods is group lasso (Yuan and Lin, 2006), while bi-level selection methods include group bridge (GrBridge) (Breheny and Huang, 2009), group exponential lasso (GEL) (Breheny, 2015), and composite MCP (cMCP) (Huang et al., 2012).
The first half of Table 1 shows the penalty functions
where
Applying group-level selection methods to the groups of coefficients
Nonetheless, to report the performance of group-level/bi-level selection methods in lag identification, we still define the time lag for these methods using the largest index of nonzero estimated coefficients in a selected group. For example, for a variable
Figure 2. Possible sparsity patterns and their corresponding lags for each type of grouped variable selection methods applied to a toy example with one single variable
Compared to the group-level/bi-level selection methods that are often applied to non-overlapping groups, overlapping group selection methods (Obozinski et al., 2011) impose group-level selection methods to overlapping groups. Interestingly, applying overlapping group selection methods appropriately yields the sparsity pattern that defines the time-lagged effect in (2), as illustrated below.
Instead of constructing one group of coefficients for a covariate
The second half of Table 1 shows the penalty functions
Due to the property of group-level selection methods, all its variables will be kept in the model if a group is selected. Therefore, for each selected group, its estimated coefficients satisfy the sparsity pattern in (2) due to the special construction of these groups. The final model for
To mimic the real data in our simulation, we make use of the microbiome data from a longitudinal study sampling the zebrafish microbiome (Hammer et al., 2024). The fecal microbiome of zebrafish was analyzed across three separate days
We simulate the response with two distributional settings, the normal distribution and the Poisson distribution. With the normal distribution, we simulate the response using a linear model:
For longitudinal data sampled at three time points, there are four possible lags, lag 0, 1, 2, and 3, where lag 3 corresponds to no association. Of the 38 taxa, we set a sparse signal of six true associations, two instances of each of lag 0, 1, and 2. We randomly assign the six true associations among the 38 taxa. Additionally, we test three magnitudes of the signal of the true associations, small, medium, and large. For the linear case these are
The original sample size of the zebrafish study
We compare seven grouped variable selection methods in two categories: group-level/bi-level selection methods and overlapping group selection methods. For the former, we applied group lasso (GrLasso), group exponential lasso (GEL), composite MCP (cMCP), and group bridge (GrBridge) to the simulated data. For the latter, we applied overlapping group lasso (O-GrLasso), overlapping group MCP (O-GrMCP), and overlapping group SCAD (O-GrSCAD).
We compare the performance of the seven grouped variable selection methods with various simulation settings: sample size
Since our goal is to identify the groups of variables associated with the response, we first examine the group true positive rate (TPR) and false positive rate (FPR). We define group rates based on whether at least one variable in the group is present in the model (regardless of whether it has the correct lag or not). Recall that we have six relevant groups with a signal (two repeats of each lag 0, 1, and 2); the remaining 32 groups are irrelevant as they have no signal. The group TPR and FPR are calculated based on the relevant and irrelevant groups, respectively.
Figure 3 shows the group TPR (left) and FPR (right) across simulation settings. Comparing the normal and Poisson settings, group TPR and FPR have better performance in the normal setting than in the Poisson setting. GrBridge has one of the best performances in the normal setting, having the lowest group FPR in all cases, and comparable group TPR, but performs the worst in the Poisson setting, having very low group TPR even as the sample size increases.
Figure 3. Group TPR and FPR for each method across three sample sizes (rows) and signal magnitudes (columns). Plot (A) shows results for the normal setting and plot (B) shows results for the Poisson setting. Solid line box plots represent the overlapping group selection methods; dotted line box plots represent the group-level/bi-level selection methods.
Comparing the sample sizes, group TPR increase as the sample size increases in all cases. Group FPR also generally decreases with an increased sample size, but there are some cases of a higher group FPR when
Comparing the signal magnitudes, group TPR remains lower with a smaller signal magnitude across all sample sizes and for both the normal and Poisson cases. We see group TPR increases as the signal size increases, however, we do not see the same trend for group FPR. The group FPRs in the normal setting are roughly the same as the signal magnitude changes. In the Poisson case we see somewhat a higher group FPR with a larger signal magnintude, but this can possibly be attributed to the increase of dispersion in the simulated responses.
Comparing the grouped variable selection methods, we generally see that the overlapping group and bi-level selection methods perform similarly to each other. The exception to this is GrLasso and O-GrLasso, which perform similarly, and worse than the other methods in the normal setting for a larger sample size and signal magnitude.
In the simulation setting that is the closest to the real data (the Poisson model with sample size of 21 and large signal magnitude), the group TPR is between 35% and 50% for a few methods such as O-GrLasso, GrLasso, GEL, and cMCP (see the right upper corner of the left panel in Figure 3B). In addition, in the same simulation setting, the group FPR is between 15% and 25% for the above methods (see the right upper corner of the right panel in Figure 3B). In other words, these few methods result in group TPR that are well above zero, and they control group FPR quite well. Such results suggest that these methods can identify potentially true signals from the real data analysis, although they may not be able to reveal all true signals for a data set of this size.
As we are also interested in seeing how well the methods perform in identifying the variables that should be present in the model, we additionally examine the variable TPR and FPR. Recall that in our simulation, we have 12 relevant variables and 102 irrelevant variables. The variable TPR and FPR are calculated based on the relevant and irrelevant variables, respectively.
Figure 4 presents the variable TPR (left) and FPR (right) across our simulation settings. As the sample size increases, we see generally an increasing variable TPR and the rate becomes low in all cases except the small-signal case. GrBridge notably remains unlikely to pick up any of the true signals in the small-signal case, although it performs among the best in other signal cases for the normal setting.
Figure 4. Variable TPR and FPR for each method across three sample sizes (rows) and signal sizes (column). Plot (A) shows results for the normal setting and plot (B) shows results for the Poisson setting. Solid line box plots represent the overlapping group selection methods, dotted line box plots represent the group-level/bi-level selection methods.
Similarly, the variable FPR decreases when the sample size increases. In all settings, we see an average variable FPR below 0.25. GrLasso retains a higher FPR as the sample size increases, unsurprisingly, as it by definition always includes all variables of a chosen group, which is an overestimation if any non-instantaneous group is selected. We also see a slight increase in variable FPR when the sample size increases from 21 to 50, likely due to the increased allowable model size allowing more irrelevant variables to be included.
Comparing the grouped variable selection methods, we see a lower variable TPR from the bi-level methods in almost all cases. This is unsurprising as bi-level methods have the possibility of excluding relevant variables before the lag time as they do not necessarily maintain the correct sparsity pattern as in (2). Except GrLasso, the overlapping group selection methods and the bi-level selection methods yield comparable variable FPR and there is no obvious winner.
Figure 5 shows the proportion of the 100 replicates in which the lag is correctly estimated for each true lag (0–3). As expected, GrLasso performs the worst for intermediate lags of 1 or 2, as it can only ever identify all of the time points (lag 0) or none of them (lag 3) in a group.
Figure 5. Proportion of correct lag-identification across simulation replicates for each of the four true lags. Plot (A) shows results for the normal setting and plot (B) shows results for the Poisson setting. Solid lines represent overlapping group selection methods, dotted lines represent the group-level/bi-level selection methods.
Lags are generally more correctly identified as the sample size increases, although the proportion of correct lag-identification remains low in the setting with the smallest signal magnitude. Except in the setting with the smallest sample size and the smallest signal magnitude, all methods generally identify the correct lag over half of the time in the normal setting. The normal setting yields a higher proportion of correct lag-identification than the Poisson setting.
In general, lag identification also improves as the signal magnitude increases, although we see a few methods performing worse for true lag 1 as the signal magnitude increases. In the Poisson setting, an increased signal magnitude occasionally leads to a decreased proportion of correct lag-identification for true lags of 1 and 3.
Comparing the grouped variable selection methods, we do not see any clear winner between the overlapping group and bi-level selection methods. While an overlapping group selection method always identifies the sparsity pattern correctly whenever it identifies the lag correctly, it is not necessarily the case for a bi-level selection method. Therefore, we also report the proportion of incorrect identification of the sparsity pattern from the bi-level selection methods. Averaged across all sample-size and signal-magnitude settings, GEL has an incorrect lag pattern 22% of the time, cMCP has an incorrect lag pattern 29% of the time, and GrBridge has an incorrect lag pattern 12% of the time in the normal setting. In the Poisson setting, GEL and cMCP have an incorrect lag pattern 44% of the time, and GrBridge has an incorrect lag pattern 62% of the time. However, as we see from Figure 5, they can still generally identify the time lag correctly. These observations suggest a careful interpretation is needed for the time-lagged association from the bi-level selection methods.
Based on the simulation results, we have not identified a clear winner from the seven methods in all simulation settings. However, we could still make the following recommendations based on our limited observations. To ensure the sparsity pattern in (2), we recommend using the overlapping group selection methods, including O-GrLasso, O-GrMCP, and O-GrSCAD. In the simulation setting that is closest to the real data (Poisson model with sample size of 21 and large signal magnitude), O-GrLasso outperforms O-GrMCP and O-GrSCAD in terms of group selection and variable selection, suggesting its better performance in detecting the time-lagged effects. Nonetheless, we regard all these methods as a toolbox for identification of time-lagged effects and suggest the use of them in a complementary way.
We apply our group penalization framework to the real dataset from Hammer et al. (2024), which originally studied the role of the gut microbiome in mediating parasitic infection of Pseudocapillaria tomentosa in zebrafish. In our application, we make use of the data collected from longitudinally sampled zebrafish fecal samples (sampled on days 0, 3, and 32) to find time-lagged associations between the abundances of microbial taxa in the zebrafish gut and the parasite burden on zebrafish.
After day 0 and before day 3, half of the tanks of the zebrafish in the study were given an antibiotic and the other half were not. After day 3, in each group of zebrafish (antibiotic and control), roughly half of them were exposed to the parasite P. tomentosa. All zebrafish were sacrificed to assess intestinal histopathology on day 32, and the parasite burden on zebrafish was measured. In other words, microbiome data from the zebrafish gut were collected (a) prior to antibiotic exposure (day 0), (b) just prior to parasite exposure but after antibiotic exposure (day 3), and (c) 29 days post-parasite exposure (day 32). Figure 6 further explains the experimental design.
Figure 6. Schematic of real data experimental design by Hammer et al. (2024). (1) Adult fish were placed in individual tanks, (2b) half of the fish were exposed to antibiotics, (3b) fish were exposed to the zebrafish parasite Pseudocapillaria tomentosa. Only the parasite groups were used in the analysis with parasite worm burden as the response. Fecal microbiome and metabolome samples were collected (2a) prior to antibiotic exposure, (3a) prior to parasite exposure, and (4) 32 days post-antibiotic exposure after which fish worm burden was counted. Sample size represents fish alive throughout the study.
Since we use the final parasite burden as our response
In our analysis, the response
We use these data to find time-lagged associations between final parasite burden of the zebrafish and genus abundances measured on day 0, 3, and 32. Since our samples are split into two groups, those exposed to antibiotics and those not exposed, we also take potential effects of antibiotic exposure into account in our modeling. To this end, we include antibiotic exposure as a main effect (
where
We do not penalize the main effects
Corresponding to different patterns of association with different lags. The first empty set corresponds to no association, the next three correspond to main effects with lags 32, 29, and 0 days, the last two correspond to main and interaction effects with lags 29, and 0 days. Note that the above construction of groups enforces the main effect to be included in the model before the corresponding interaction effects, whereas the bi-level selection methods have no such restriction.
We tabulate in Table 2 the list of genera that are identified to be associated with parasite burden, together with their lags of such associations. For main effects, there are three possible lags—0, 29, and 32 days; for interaction effects, there are two possible lags—0 and 29 days, as there is no interaction between the antibiotic treatment and the microbial abundance on day 0. A lag of 0 days indicates an instantaneous effect, a lag of 29 or 32 days implies there is a lag for the microbial abundance to affect the parasite burden, either after antibiotic exposure or before antibiotic exposure. From the results in Table 2, we can draw the following conclusions.
First, the results highlight the importance of identifying time-lagged associations. Five of the eleven identified genera have a time-lagged effect, which would not have been found if we were not using the proposed approaches.
Second, the overlapping group selection methods perform more similarly to each other, with O-GrLasso and O-GrSCAD producing the same results, while the bi-level selection methods do not have many similarities amongst themselves. GrBridge did not even identify any associated taxa. The most commonly identified genus was Candidatus Accumulibacter which was present in every method except O-GrMCP and GrBridge.
Third, we see a variety of identified lags. Candidatus Accumulibacter was identified with an instantaneous effect (lag of 0 days) by O-GrLasso/O-GrSCAD and GEL, Chitinibacter was found to have a lag of 29 days, including a 29-day lagged antibiotic interaction effect by all overlapping methods. Hyphomicrobium had a lag of 29 days by O-GrLasso and O-GrSCAD, and a lag of 32 days by O-GrMCP and cMCP.
Table 3 further presents the estimated coefficients for the identified genera in each model. Table 3 demonstrates a key difference between the overlapping group selection methods and the bi-level selection methods, namely, whether or not the sparsity pattern in (2) is met. For example, Candidatus Accumulibacter has a lag of 0 days for O-GrLasso and O-GrSCAD, as well as GEL, but GEL does not find an association on day 3, violating the sparsity pattern in (2). Additionally, Candidatus Accumulibacter is identified by cMCP to have an association with worm burden, but only the interaction effect on day 32, with no effects on earlier days and no main effects.
The original paper (Hammer et al., 2024) including these zebrafish data conducted an analysis on the mediating role played by the gut microbiome on the relationship between gastrointestinal metabolites and parasitic infection outcome. This section highlights how our findings can further inform the longitudinal aspect of the relationship between the gut microbiome and parasite burden. Many of the genera the mediation study comments on as interesting are also found by our approach, and we expand upon these genera below.
Hammer et al. (2024) identified taxa in the Pseudomonas and Mycobacterium genera to be mediators in the relationship of the important Vitamin E metabolite
Our findings also unite previous research which used these data by uncovering temporal changes in microbial association that help explain microbiota connections to parasite infection burden. For instance, results from Hammer et al. (2024) show that salicylaldehyde, which may be partially controlled by Pelomonas, has particularly strong effects on egg larvation and development. Our finding here that there exists an association between Pelomonas and infection burden on day 3 is both novel and important because this is the time fish hosts were exposed to parasite eggs and based on experimental evidence from stemming from this earlier work it is expected that salicylaldehyde-related inhibition of helminth maturation would be most pronounced at this time point in the study. Thus, this association from O-GrMCP points to time-dependent activity of Pelomonas that could help to explain and unite a connection between Pelomonas and salicylaldehyde to in vivo and in vitro results that implicate salicylaldehyde as an anthelmintic agent.
Additionally, results of applying these methods highlight a possible route by which gut microbes might regulate host intestinal structure to limit helminth infection, by regulating tight junction integrity. Results from both O-GrMCP and GEL point to day-3 associations in Cetobacterium are inversely related to helminth parasite infection. Prior work using the zebrafish model has shown that taxa within Cetobacterium synthesize vitamin B12 which mechanistically enhances gut barrier tight junction integrity to prevent microbial pathogen infection and improve gut microbiome structure stability (Qi et al., 2023). These findings are also relevant to nematode infection, where it has been shown using other infection models that early parasite exposure results in loss of epithelial barrier integrity as a result of changes in tight-junction related protein expression (Fernández-Blanco et al., 2015). Similar tight-junction regulating activity could underscore our results which point to early time points of Cetobacterium relative abundance negatively associating with parasite infection burden, potentially as a result of vitamin B12 biosynthesis.
Overall, we identify similar taxa associating with parasite worm burden as in Hammer et al. (2024), but our results provide important nuance to these findings by revealing time-dependent microbial associations with infection burden. For instance, our finding that Pelomonas is inversely associated with infection parasite burden on day 3 could help explain distinct results from Hammer et al. (2024) that revealed a connection between Pelomonas with salicylaldehyde, and salicylaldehyde to egg larvation. Furthermore, the connection between Cetobacterium and infection burden also uncovers a testable hypothesis regarding the relationship between microbes to metabolites and parasite infection that was not previously identified in the Hammer et al. (2024) analysis, and points to an additional route by which microbes could regulate helminth parasite infection. Together, clarifying the time during which a microbe might produce potent anthelmintic products or influence host response to infection can elucidate insights into the activity of microbes across the time range of parasite infection, possibly imparting new ways the gut microbiome may be harnessed to combat helminth parasite infection.
To validate the microbial taxa that were identified to be associated with parasitic infection, we conduct a validation analysis using the additional measurements of metabolites in the zebrafish study (see Figure 6). We focus our analysis on the two metabolites that were found to be linked to parasite infection and whose effect on infection burden is mediated by members of the gut microbiome (Hammer et al., 2024), namely, salicylaldehyde and
In the validation analysis, the response
We use these data to find time-lagged associations between genus abundances measured on day 0, 3, 32 and final metabolite levels of the zebrafish measured on day 32. Since our samples are split into four groups, those exposed and not exposed to antibiotics, as well as those exposed and not exposed to parasites, we take potential effects of antibiotic and parasite exposure into account in our modeling. Compared to the model in (3), we include parasite exposure as an additional main effects (
where
Similar to Section 4.2, we use bi-level/group-level selection methods and overlapping group selection methods to estimate the coefficients in (4). We do not penalize the main effects
Table 6. Coefficient estimates of microbial main (M) and interaction (I) effects on salicylaldehyde.
Notably, of the eleven identified genera in the parasite burden analysis, eight were also identified and thus partially validated by the metabolite analysis. This result provides additional evidence for the critical role of the identified genera in the microbiome-metabolome-host relationship. Similar to Hammer et al. (2024), we also found an association between Pelomonas and salicylaldehyde, further supporting their relationship and joint roles in parasite infection burden as discussed in Section 4.4. Additionally, Hammer et al. (2024) identified Pseudomonas and Mycobacterium as mediators in the relationship between
In this paper, we present a novel framework for identifying time-lagged associations between time-varying covariates and a static response, which enables the investigation of dynamics of host-microbiome interactions. Simulation studies demonstrate the efficacy of the framework in accurately identifying time-lagged associations.
Applying our framework to real zebrafish data further validated its utility. We identified eleven microbial taxa that exhibit associations with zebrafish parasite burden, four of which were instantaneous and seven others were lagged. Three identified taxa overlapped with those identified in the original study, two were instantaneous and one had a lag, reinforcing previous findings and highlighting new insights into time-lagged associations. For example, some associations changed their signs depending on the time lag, suggesting that the timing of intervention is as crucial as selecting the appropriate microbial target. The microbial taxa identified offer insights into potential mechanisms underlying the interplay between the gut microbiome and parasitic infections. This work contributes to a body of research that aims to clarify host-microbiome-parasite dynamics and informs future research toward developing targeted interventions for parasite control.
While this framework offers a practical approach to estimating time-lagged associations, there are a few limitations when using this framework. First, we argue that if there is an association present, it is measurable from the first time point sampled, and is present up until the lagged time point. In cases where this structure is not applicable, such as having only an instantaneous association, and no association from previous time points, the bi-level selection methods offer more flexibility in the temporal structure of which covariates are included. The definition of what it means to have a lag may need to be revisited or redefined depending on the context.
Second, another limitation relates to the length of available longitudinal data. Our method includes the entire timeframe of the data in search for lags, as our method assumes all prior covariates to the lag remain relevant. This assumption could be problematic if researchers are working with extended datasets spanning several years. The researchers would need to determine the reasonable timescale for a lag for their application. In some cases, a lagged association of 2 years would be reasonable, but in others only a 2-month lagged association would be reasonable.
Third, our framework uses different group penalization methods that can identify a set of interesting taxa. Future work can improve the prioritization of which of the model-identified taxa are interesting taxa to focus on. It is possible that different group penalization methods will identify either different sets of taxa or different lags, or both. Our framework encourages the researchers to use the set of identified taxa from the framework, but future work can help narrow down the focus.
Our application of this framework focused on advancing the understanding of microbial ecology and its influence on host health. However, this framework can be applicable to a much broader range of scientific fields, as it can be used whenever there is an interest in looking for time-lagged associations between longitudinal data and a static response.
The nucleotide data underlying the findings of this study are available in the NCBI Sequence Read Archive (SRA) under BioProject ID PRJNA1132310, and annotated metabolomic data from positive and negative ion modes are available here (https://github.com/CodingUrsus/Zebrafish_Microbiome_and_Parasites/).
EP: Formal Analysis, Methodology, Software, Visualization, Writing–original draft, Writing–review and editing. AH: Data curation, Validation, Writing–review and editing. TS: Data curation, Funding acquisition, Validation, Writing–review and editing. YJ: Conceptualization, Funding acquisition, Methodology, Supervision, Writing–original draft, Writing–review and editing.
The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. The authors’ research was supported in part by National Institutes of Health grant R01 GM126549, National Science Foundation grant #2025457, and the College of Science Research and Innovation Seed (SciRIS) Program at Oregon State University.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The author(s) declare that no Generative AI was used in the creation of this manuscript.
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Afzaal, M., Saeed, F., Shah, Y. A., Hussain, M., Rabail, R., Socol, C. T., et al. (2022). Human gut microbiota in health and disease: unveiling the relationship. Front. Microbiol. 13, 999001. doi:10.3389/fmicb.2022.999001
Aitchison, J. (1982). The statistical analysis of compositional data. J. R. Stat. Soc. Ser. B Methodol. 44, 139–160. doi:10.1111/j.2517-6161.1982.tb01195.x
Breheny, P. (2015). The group exponential lasso for Bi-level variable selection. Biometrics 71, 731–740. doi:10.1111/biom.12300
Breheny, P., and Huang, J. (2009). Penalized methods for bi-level variable selection. Statistics Its Interface 2, 369–380. doi:10.4310/SII.2009.v2.n3.a10
Breheny, P., and Huang, J. (2015). Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Statistics Comput. 25, 173–187. doi:10.1007/s11222-013-9424-2
Davidson, G. L., Somers, S. E., Wiley, N., Johnson, C. N., Reichert, M. S., Ross, R. P., et al. (2021). A time-lagged association between the gut microbiome, nestling weight and nestling survival in wild great tits. J. Animal Ecol. 90, 989–1003. doi:10.1111/1365-2656.13428
Fernández-Blanco, J. A., Estévez, J., Shea-Donohue, T., Martínez, V., and Vergara, P. (2015). Changes in epithelial barrier function in response to parasitic infection: implications for ibd pathogenesis. J. Crohn’s Colitis 9, 463–476. doi:10.1093/ecco-jcc/jjv056
Gerber, G. K. (2014). The dynamic microbiome. FEBS Lett. 588, 4131–4139. doi:10.1016/j.febslet.2014.02.037
Gomaa, E. Z. (2020). Human gut microbiota/microbiome in health and diseases: a review. Antonie Leeuwenhoek 113, 2019–2040. doi:10.1007/s10482-020-01474-7
Grieneisen, L., Blekhman, R., and Archie, E. (2023). How longitudinal data can contribute to our understanding of host genetic effects on the gut microbiome. Gut Microbes 15, 2178797. doi:10.1080/19490976.2023.2178797
Hammer, A. J., Gaulke, C. A., Garcia-Jaramillo, M., Leong, C., Morre, J., Sieler Jr, M. J., et al. (2024). Gut microbiota metabolically mediate intestinal helminth infection in zebrafish. mSystems 9, e0054524–24. doi:10.1128/msystems.00545-24
Huang, J., Breheny, P., and Ma, S. (2012). A selective review of group selection in high-dimensional models. Stat. Sci. a Rev. J. Inst. Math. Statistics 27, doi:10.1214/12–STS392
Jacob, L., Obozinski, G., and Vert, J.-P. (2009). “Group lasso with overlap and graph lasso,” in Proceedings of the 26th annual international conference on machine learning (Montreal Quebec Canada: ACM), 433–440. doi:10.1145/1553374.1553431
Luna, P. N., Mansbach, J. M., and Shaw, C. A. (2020). A joint modeling approach for longitudinal microbiome data improves ability to detect microbiome associations with disease. PLOS Comput. Biol. 16, e1008473. doi:10.1371/journal.pcbi.1008473
Obozinski, G., Jacob, L., and Vert, J.-P. (2011). Group lasso with overlaps: the latent group lasso approach. arXiv preprint arXiv:1110.0413
Qi, X., Zhang, Y., Zhang, Y., Luo, F., Song, K., Wang, G., et al. (2023). Vitamin b12 produced by cetobacterium somerae improves host resistance against pathogen infection through strengthening the interactions within gut microbiota. Microbiome 11, 135. doi:10.1186/s40168-023-01574-2
Sarkar, A., Yoo, J. Y., Valeria Ozorio Dutra, S., Morgan, K. H., and Groer, M. (2021). The association between early-life gut microbiota and long-term health and diseases. J. Clin. Med. 10, 459. doi:10.3390/jcm10030459
Wilmanski, T., Diener, C., Rappaport, N., Patwardhan, S., Wiedrick, J., Lapidus, J., et al. (2021). Gut microbiome pattern reflects healthy ageing and predicts survival in humans. Nat. Metab. 3, 274–286. doi:10.1038/s42255-021-00348-0
Keywords: grouped variable selection, longitudinal microbiome data, parasite worm burden, sparsity pattern, time lag, zebrafish
Citation: Palmer E, Hammer A, Sharpton T and Jiang Y (2025) A group penalization framework for detecting time-lagged microbiota-host associations. Front. Genet. 16:1504443. doi: 10.3389/fgene.2025.1504443
Received: 30 September 2024; Accepted: 05 February 2025;
Published: 03 March 2025.
Edited by:
Hongmei Jiang, Northwestern University, United StatesReviewed by:
Michael B. Sohn, University of Rochester, United StatesCopyright © 2025 Palmer, Hammer, Sharpton and Jiang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Yuan Jiang , eXVhbi5qaWFuZ0BvcmVnb25zdGF0ZS5lZHU=
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.
Research integrity at Frontiers
Learn more about the work of our research integrity team to safeguard the quality of each article we publish.