Skip to main content

DATA REPORT article

Front. Genet.
Sec. Computational Genomics
Volume 16 - 2025 | doi: 10.3389/fgene.2025.1522253

Exploring the Taxonomical and Functional Profiles of Marine Microorganisms in Submarine Groundwater Discharge Vent Water from Mabini, Batangas, Philippines through Metagenome-Assembled Genomes

Provisionally accepted
  • 1 Natural Sciences Research Institute, University of the Philippines Diliman, Quezon City, Philippines
  • 2 Institute of Biology, College of Science, University of the Philippines Diliman, Quezon, National Capital Region, Philippines

The final, formatted version of the article will be published soon.

    I. Introduction Submarine groundwater discharge (SGD) refers to the movement of water from land to coastal waters, flowing across the land-ocean interface (Adyasari et al., 2019). SGD is ubiquitous in sandy, rocky, and muddy shorelines and may include fresh groundwater of terrestrial origin, recirculated seawater, or a combination of both (Adyasari et al., 2019; Santos et al., 2021). The presence of SGD in these areas results in physical and chemical gradients that create unique biogeochemical environments. SGD acts as a conduit for the transport of materials such as gases, nutrients, and trace metals, from land to sea (Moore, 2010; Knee and Paytan, 2011). The flux of nitrogen and phosphorus to the ocean from total SGD, which includes both fresh and recirculated seawater, is estimated to exceed riverine inputs on a global scale (Cho et al., 2018). The SGD-mediated inflow of nutrients can significantly impact coastal ecosystems and water quality, altering levels of dissolved and gaseous metabolites, including ammonium, methane, and hydrogen sulfide (Bernard et al., 2014; Santos et al., 2021; Schlüter et al., 2004). This influences microbial communities and their metabolic activities in these specific locations (Purkamo et al., 2022). Similar to terrestrial subsurface environments, deep marine sediments are also characterized by an absence of photosynthetically produced labile organic carbon (Chen et al., 2023). Because of this, groundwater microorganisms have developed diverse strategies to ensure survival and persistence. Among these strategies is the ability to utilize ancient organic carbon from rocks, allochthonous organic carbon, or byproducts from the degradation of organic contaminants (Griebler and Lueders, 2009; Smith et al., 2015). Other groundwater microorganisms also have adaptations that enable them to fix inorganic carbon by utilizing energy from the oxidation of substrates such as nitrite, ammonium, reduced iron, and sulfur compounds (Ruiz-González et al., 2021). Applying shotgun metagenomics and assembling microbial genomes from metagenomic data can reveal important insights into the structure-function relationships of microbial communities within complex environments (Overholt et al., 2020). Through the use of metagenome-assembled genomes (MAGs), researchers can reconstruct near-complete microbial genomes, enabling precise taxonomic identification and comprehensive functional profiling of these assembled genomes (Mangoma et al., 2024). For instance, Mangoma et al. (2024) successfully recovered taxonomically diverse MAGs from Buhera soda pans in Zimbabwe. These genomes revealed metabolic pathways associated with nitrogen fixation and sulfur cycling, highlighting the ecological roles of microbial communities in such extreme environments. Similarly, a study by He et al. (2016) leveraged metagenomic sequencing of sediment samples from deep-sea hydrothermal vents in the Guaymas Basin, Mexico, to identify Bathyarchaeota MAGs. These genomes were found to contain genes encoding the Wood-Ljungdahl (WL) pathway, an acetogenic process crucial for carbon fixation and acetate production. This finding underscores the pivotal role of Bathyarchaeota in the carbon cycle within deep-sea vent ecosystems. Building on the work of Mangoma et al. (2024) and He et al. (2016), this data report seeks to deepen our understanding of specific environments, such as SGDs, from a microbial perspective. This report expands our knowledge of microbial species thriving in such environments, sheds light on their functional roles, and reveals untapped resources for industrial and medical applications through the analysis of these assembled genomes. In this study, high-quality MAGs were generated from shotgun metagenome data derived from water samples from an SGD vent in Mabini, Batangas, Philippines. The presence of SGD in the collection site was documented by Cardenas et al. (2020) using Radon (222Rn) concentrations, a natural tracer for SGD (Burnett et al., 2003; Burnett et al., 2006; Cardenas et al., 2020). The generated MAGs were annotated to identify their potential environmental roles and subsequently compared to existing genomes for further insights. This data report offers preliminary findings that can serve as a foundation for future studies. It provides valuable insights into the limited research on the microbial dimension in SGD-associated areas within the Philippines. II. Methods Water samples (4 L) were collected from an SGD vent in the Sea Spring site in a coastal area in Mabini, Batangas, Philippines (13.68701 N, 120.89573 E). Divers brought a 4 L-Nalgene bottle filled with surface water to the floor of the Sea Spring site, about 5 to 6 meters in depth. The surface water was purged from the bottle using pressurized air from the divers’ tanks. After purging, the bottles were titled towards the vent to collect the SGD water. Using a vacuum pump filtration system, the water samples were filtered through a 3 µm polycarbonate track-etched (PCTE) membrane (Sterlitech Corp., USA). After filtration, the membrane filter was placed in a cryovial and stored in a portable cryo tank containing liquid nitrogen before transporting to the laboratory. Total DNA was extracted from the membrane filter using the DNeasy PowerSoil® Pro Kit (QIAGEN), following the manufacturer’s protocol with minimal modifications. Specifically, three sample replicates were prepared, each subjected to the lysis step, and combined in a single column to ensure sufficient DNA concentration. The concentration of the extracted DNA was determined using the Denovix QFX Fluorometer and dsDNA broad-range kit following the manufacturer’s instructions. The DNA extract was sent to Macrogen, Inc. (South Korea) for shotgun metagenomic sequencing, at 25 Gb throughput, and 150 bp paired-end setting using the Illumina NovaSeq™ 6000 system. De novo metagenome assembly was performed using various bioinformatics tools in the KBase v2.7.11 platform (Arkin et al., 2018). After merging the forward and reverse reads during the importing stage, the resulting merged raw reads were analyzed with FastQC v0.12.1 (Andrews, 2010) and yielded 77,717,154 total sequences, 11.7 Gbp total bases, 45% G+C content, and no poor-quality sequences. Consequently, the merged raw reads were trimmed using Trimmomatic v0.36 (Bolger et al., 2014) with a sliding window size of 4 and a minimum quality set to 30. The Q30-trimmed reads were assembled using three different assemblers namely: metaSPAdes v3.15.3 (Nurk et al., 2017; Prjibelski et al., 2020), MEGAHIT v1.2.9 (Li et al., 2016), and IDBA-UD v1.1.3 (Peng et al., 2012), all using default parameters. The assemblers metaSPAdes, MEGAHIT, and IDBA-UD showed a total DNA length of 187,749,451 bp, 178,126,656 bp, and 108,098,378 bp, respectively. The metaSPAdes assembly, which had the longest total sequence length, was selected for downstream processing. The metagenomic contigs from the assembly chosen were grouped into bins using three binning tools – CONCOCT v1.1 (Alneberg et al., 2014), MaxBin2 v2.2.4 (Wu et al., 2016), and MetaBAT2 v1.7 (Kang et al., 2019), following default parameters (minimum contig length of 300 ≤ 2000). The generated bins from these three binning tools were consolidated and optimized using the DAS Tool v1.1.2 (Sieber et al., 2018) using default parameters (diamond, 0.5 score threshold, 0.6 duplicate penalty, and 0.5 megabin penalty). The generated bins from DAS Tool optimization were quality-filtered using CheckM v1.0.18 (Parks et al., 2015) to have ≥90% completeness and ≤5% contamination according to the high-quality MiMAG standards (Bowers et al., 2018). The completeness of the genomes was also determined using BUSCO v5.4.6 (Simão et al., 2015). MAGs’ information and taxonomic identities were visualized using the circlize package (Gu et al., 2014) in RStudio (RStudio Team, 2024). The filtered bins were compared to the top 10 closest known genomes using the SpeciesTreeBuilder v0.1.4 (Arkin et al., 2018), which analyzes 49 clusters of orthologous groups (COGs) to determine phylogenetic relationships. In this report, tree reconstruction was conducted without an outgroup to maintain focus on the relationships among the GenBank genomes, allowing for a more refined understanding of their phylogenetic proximity. Consequently, all genomes underwent general annotation using Rapid Annotation using Subsystem Technology (RAST) through the SEED viewer v2.0 platform (Aziz et al., 2014; Overbeek et al., 2013) using the RASTtk annotation scheme. This report specifically highlights genes with critical functions in biogeochemical cycles focusing on those involved in sulfur metabolism, iron acquisition and metabolism, potassium metabolism, phosphorus metabolism, and nitrogen metabolism. The phylogenetic tree and RAST results were visualized using ggtree (v1.14.6; Yu et al., 2016) and ggplot (v3.5.1; Wickham, 2016) packages through RStudio (RStudio Team, 2024). The bins were also subjected to Distilled and Refined Annotation of Metabolism (DRAM) v0.1.2 (Shaffer et al., 2020) using default settings. To determine the functional Cluster of Orthologous Groups (COGs), eggNOG-mapper v2 (Cantalapiedra et al., 2021) with eggNOG v5.0 (Huerta-Cepas et al., 2018) were utilized using default settings. Afterward, the resulting .csv files were manipulated in RStudio (RStudio Team, 2024) using the packages dplyr (v1.1.4; Wickham et al., 2023), readr (v2.1.5; Wickham et al., 2024), and stringr (v1.5.1; Wickham, 2023) to generate relative frequencies of COG categories for each MAG and across all MAGs. Lastly, genome mining of biosynthetic gene clusters (BGCs) was also performed using antiSMASH v7.0 (Blin et al., 2023) with a relaxed detection strictness. The predicted BGCs from the seven high-quality MAGs were recorded and tallied. III. Data and Analysis From the Q30-trimmed metagenomic reads, seven bins were filtered and optimized. Figure 1A provides a summary of their genome characteristics. These seven MAGs exhibited CheckM completeness of over 90% and contamination of under 5%, aligning with MIMAG standards (Bowers et al., 2017). Notably, bins 002, 023, and 027 surpass 98% in CheckM completeness, while bins 010, 023, 024, and 027 reported less than 1% contamination. Additionally, using BUSCO, four of the seven bins scored above 90% completeness. Further quality assessment with QUAST v.5.2.0 (Mikheenko et al., 2018) revealed genome sizes ranging from 1.37 to 2.17 Mbp across the bins. Bins 002, 010, 023, and 027 contain fewer than 75 contigs, with bin 027 having the least number at 55 contigs. All bins maintained an L50 value below 50, with bins 002 and 010 presenting the lowest at 13. The N50 values for all bins were above 14,000, with bin 010 having the highest at 48,078. The GC content of all seven bins ranged from 39.6% to 62.88%. The comprehensive results for CheckM, BUSCO, and QUAST assessments of the seven MAGs are detailed in Supplementary File 1. The seven bins were taxonomically classified using the Genome Taxonomy Database toolkit (GTDB-tk) v2.3.2 (Chaumeil et al., 2019), and their taxonomic identities are shown in Figure 1B. Four of the seven MAGs were classified under Archaea, and three were under Bacteria. The MAGs under Archaea were divided into two phyla: Thermoproteota (Bins 002, 023, 025) and Halobacteriota (Bin 027), while the MAGs under Bacteria were divided into Pseudomonadota (Bins 024, 029) and Bacteroidota (Bin 010). Detailed taxonomic identities of all MAGs up to genus level are presented in Supplementary File 2. Figure 2A presents the top 10 closest genomes to the bins classified under the domain Archaea. Figure 2B illustrates the phylogenetic relationships for bins identified under Bacteria and their respective top 10 closest genomes. For archaeal MAGs (Figure 2A), QMWW01 (Bin 002) was found to cluster near the Staphylothermus clade, suggesting it may belong to this genus. This is supported by GTDB results (Figure 1B), which identified QMWW01 within the family Desulfurococcaceae, the same family as Staphylothermus. QNYQ01 (Bin 023) and WAQM01 (Bin 025) formed a monophyletic clade near to Thermosphaera aggregans, also within Desulfurococcaceae (Anderson et al., 2009). QMWW01, QNYQ01, and WAQM01 clustered with other GenBank genomes, including Staphylothermus and Thermosphaera, all classified under the order Sulfolobales. The Sulfolobales genomes displayed genes linked to nitrogen, phosphorus, and sulfur metabolism. Sulfolobales is a group of thermoacidophilic Archaea where the majority of which are facultatively or obligately chemolithoautotrophic. The ability to metabolize sulfur, whether in its elemental form or reduced inorganic sulfur compounds, enables Sulfolobales to grow autotrophically. As a result, it is their most important physiological characteristic (Liu et al., 2021). Additionally, only the Sulfolobales MAGs were observed to possess genes associated with potassium metabolism. Notably, QMWW01 uniquely exhibited genes involved in iron acquisition and metabolism within this clade. On the opposite branch of the tree, WYZ-LMO2 (Bin 027) formed a monophyletic clade with Archaeoglobus fulgidus, along with other members of the genera Ferroglobus, Geoglobus, and other Archaeoglobus, all belonging to the family Archaeoglobaceae (Brileya & Reysenbach, 2014). All Archaeoglobaceae genomes exhibited genes associated with nitrogen and sulfur metabolism. This thermophilic and obligate anaerobic family, found in marine and terrestrial environments, is known for their diverse metabolic capabilities including nitrate and sulfate reduction (Brileya & Reysenbach, 2014). Notably, WYZ-LMO2 was the only genome within the clade to possess genes associated with phosphorus metabolism. Lastly, no genome within the clade exhibited genes for iron acquisition and metabolism. For bacterial MAGs (Figure 2B), Glacieola sp. (Bin 024) and UBA8309 (Bin 029) were identified as closely related to the GenBank genome Glaciecola pallidula DSM 14239. All three genomes were classified under the phylum Pseudomonadota (syn. Proteobacteria). Both Glaciecola pallidula DSM 14239 and Glaciecola sp. contained genes associated with all the types of metabolism genes analyzed in this report. Members of the genus Glaciecola are known to inhabit various marine environments (Bian et al., 2011; Qin et al., 2014), including seawater similar to the sample source in this study. This genus was also recognized for its ability to break down biopolymers such as cellulose, chitin, and xylan (Qin et al., 2014), suggesting a significant role in organic matter degradation and carbon cycling in marine environments. For UBA8309, the MAG was classified under the family Candidatus Puniceispirillaceae. Puniceispirillum marinum IMCC1322, which belongs to the same family, is known for its metabolic generalism in oceanic nutrient cycling and possesses genes associated with dimethylsulfoniopropionate (DMSP), which may play a role in marine sulfur cycling (Oh et al., 2010). This suggests that UBA8309 may also play a role in nutrient cycling in SGD areas. UBA10364 (Bin 010) was identified as the only MAG classified under Bacteroidota and was the most distantly related bin compared to the GenBank bacterial genomes. The MAG falls under the order Flavobacteriales. Members of this group are known to metabolize sulfur-containing substances and participate in the sulfur cycle (Wang et al., 2023). They also produce extracellular hydrolases to degrade macromolecular organic substances (Wang et al., 2023). The presence of sulfur metabolism genes in UBA10364 suggests a role in sulfur cycling, and the detection of genes related to phosphorus, potassium, and iron metabolism indicates potential involvement in cycling these elements as well. The nutrient metabolism genes identified in the bins were further analyzed (Supplementary File 4). For nitrogen metabolism, the bins were found to contain genes associated with nitrate and nitrite ammonification, as well as ammonia assimilation. In the case of phosphorus metabolism, genes involved in phosphate metabolism, high-affinity phosphate transport, phosphate regulon, and polyphosphate synthesis were detected. For potassium metabolism, the identified genes included those related to the glutathione-regulated potassium-efflux system, hyperosmotic potassium uptake, and potassium homeostasis. For sulfur metabolism, genes linked to galactosylceramide and sulfatide metabolism, sulfite reduction, and thioredoxin-disulfide reductase were observed. Lastly, within the iron acquisition and metabolism category, genes encoding encapsulating proteins for peroxidase enzymes, iron acquisition mechanisms, and the hemin transport system were identified. In addition to the RAST results, the DRAM analysis (Supplementary File 5) of the bins revealed a diverse array of nutrient metabolism genes. These included genes involved in carbon, nitrogen, and sulfur metabolism, as well as genes associated with arsenate and mercury reduction, methanogenesis, and alcohol production. The diverse metabolic capabilities of the recovered bacterial and archaeal MAGs, supported by annotated genes and their taxonomic identities, suggest that these putative microorganisms may play a crucial role in biogeochemical cycling in SGD areas of Mabini, Batangas. This conclusion is further supported by various studies highlighting the metabolic functions of similar taxa in nutrient cycling. Functional COGs were identified across all MAGs using eggNOG. The analysis revealed that the category with the highest relative frequency of COGs was "Unknown function" (S) (Supplementary File 6 Figure 1). This observation aligns with individual MAG results, where the most frequently assigned COG in all bins also belonged to the "Unknown function" category (Supplementary File 6 Figures 2-8). This finding suggests that many COGs within the bins require additional data for proper classification. In addition to the "Unknown function" category, several functional COGs were found to have relative frequencies exceeding 5%, including J (translation, ribosomal structure, and biogenesis), C (energy production and conversion), E (amino acid transport and metabolism), H (coenzyme transport and metabolism), and L (replication, recombination, and repair). The abundance of these functional genes suggests that the microorganisms in the vent water heavily rely on these processes for survival, particularly in the SGD environment. In addition to the analysis of the nutrient metabolism genes and COGs, the presence of BGCs within the MAGs was assessed, as detailed in Supplementary File 3. Using antiSMASH, it was found that only QNYQ01 (Bin 023) and WYZ-LMO2 (Bin 027) lacked detectable BGCs. In contrast, the remaining MAGs exhibited BGCs associated with ectoine, ribosomally synthesized, and post-translationally modified peptides (RiPP)-like compounds, terpene, and redox-cofactors. Among these, Glacieola sp. (Bin 024) contained the highest number of BGCs, including those related to ectoine, RiPP-like compounds, and terpene. Notably, RiPP-like and terpene-related BGCs were the most prevalent across all MAGs. RiPP-like compounds, previously identified as bacteriocin-encoding genes (Blin et al., 2021), have been studied for their anticancer and antibiotic applications (Negash & Tsehai, 2020; Thapar & Kumar Salooja, 2023). Similarly, terpenes are renowned for their diverse biological activities, including antiplasmodial, antiviral, anticancer, and antidiabetic properties (Cox-Georgian et al., 2019). The identification of these BGCs in putative microbes from vent water at SGD sites in Mabini, Batangas, suggests their potential biotechnological and pharmaceutical applications. Although these microorganisms remain unculturable, future advances in cultivation and isolation techniques may enable us to unlock their potential, leading to the discovery of novel therapeutic compounds and other valuable bioactive agents.

    Keywords: Biosynthethic gene clusters, Metagenome-assembled genome, Nutrient Metabolism Genes, Shotgun sequencing, submarine groundwater discharge, Vent Water

    Received: 04 Nov 2024; Accepted: 13 Jan 2025.

    Copyright: © 2025 Veluz, Mallari, Gloria and Siringan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence: Joshua Talavera Veluz, Natural Sciences Research Institute, University of the Philippines Diliman, Quezon City, Philippines

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.