
95% of researchers rate our articles as excellent or good
Learn more about the work of our research integrity team to safeguard the quality of each article we publish.
Find out more
BRIEF RESEARCH REPORT article
Front. Artif. Intell. , 18 March 2025
Sec. Language and Computation
Volume 8 - 2025 | https://doi.org/10.3389/frai.2025.1558696
This study evaluates the biases in Gemini 2.0 Flash Experimental, a state-of-the-art large language model (LLM) developed by Google, focusing on content moderation and gender disparities. By comparing its performance to ChatGPT-4o, examined in a previous work of the author, the analysis highlights some differences in ethical moderation practices. Gemini 2.0 demonstrates reduced gender bias, notably with female-specific prompts achieving a substantial rise in acceptance rates compared to results obtained by ChatGPT-4o. It adopts a more permissive stance toward sexual content and maintains relatively high acceptance rates for violent prompts (including gender-specific cases). Despite these changes, whether they constitute an improvement is debatable. While gender bias has been reduced, this reduction comes at the cost of permitting more violent content toward both males and females, potentially normalizing violence rather than mitigating harm. Male-specific prompts still generally receive higher acceptance rates than female-specific ones. These findings underscore the complexities of aligning AI systems with ethical standards, highlighting progress in reducing certain biases while raising concerns about the broader implications of the model's permissiveness. Ongoing refinements are essential to achieve moderation practices that ensure transparency, fairness, and inclusivity without amplifying harmful content.
This study investigates the biases in Gemini 2.0 Flash Experimental1 (hereafter referred to as “Gemini 2.0”), a state-of-the-art large language model (LLM) developed by Google, through a systematic analysis of its responses to a variety of textual prompts. The model has been accessed through its chat interface, not via API. This work builds upon prior research by examining how Gemini 2.0 addresses content and gender biases, comparing its performance with that of ChatGPT-4o explored in June 2024 (Balestri, 2024). By systematically analyzing acceptance rates for various prompts, this research evaluates whether advancements in AI architecture and moderation policies have effectively mitigated these biases (Zhao et al., 2017).
The rapid development of large language models (LLMs) has significantly transformed artificial intelligence (AI) technologies, enabling sophisticated natural language generation and interaction (Brown et al., 2020; Achiam et al., 2023; Vaswani et al., 2017). However, alongside these advancements, critical challenges persist in addressing bias and fairness in content moderation. Generative AI systems often mirror societal inequalities, raising ethical concerns about how these technologies shape discourse and decision-making processes (Bolukbasi et al., 2016; Caliskan et al., 2017). Previous research has emphasized the importance of ethical frameworks for aligning content moderation systems with societal values while balancing harm prevention and freedom of expression (Hertwig et al., 2023).
High-profile incidents have highlighted the pitfalls of inconsistent moderation practices. For example, Microsoft's Tay chatbot, launched in 2016, was manipulated to produce offensive and racist content, leading to its shutdown within 24 h (Vincent, 2016). Similarly, Facebook's removal of the iconic “Napalm Girl” photograph due to its depiction of nudity sparked debates over community standards and cultural sensitivity in moderation policies (Goulard, 2016). These cases exemplify the challenges of navigating ethical dilemmas in AI-driven moderation systems.
Content moderation systems face unique difficulties in handling sensitive issues such as sexuality, violence, and gender. Societal taboos surrounding sexual content often lead to overly restrictive moderation practices, while violent or drug-related content may be treated with more leniency due to normalization in media and cultural narratives (Butler, 1993). Moreover, biases in moderation systems frequently reflect societal inequalities, as seen in platforms like YouTube, where LGBTQ+ creators have faced disproportionate demonetization of their content under the guise of enforcing community standards (Kaser, 2019). The importance of addressing these disparities is underscored by the significant role AI systems play in shaping societal discourse (Binns, 2018; Gupta and Ranjan, 2024).
A recent study conducted on various LLMs (Mirza et al., 2024) provided a comprehensive evaluation of bias across multiple dimensions, including gender, race, and age. The researchers examined four leading models released in 2024—Gemini 1.5 Pro, Llama 3 70B, Claude 3 Opus, and GPT-4o—focusing on occupational and crime-related scenarios. Their findings revealed persistent biases in these systems, such as the overrepresentation of female characters in certain professions and inconsistencies in crime-related outputs.
Research has also drawn attention to the technical aspects of content moderation. LLMs rely on extensive training data, which inherently reflects biases and systemic inequalities in the real world (Barocas and Selbst, 2016; O'neil, 2017). This data serves as the epistemic foundation of these systems, yet its limitations inevitably manifest in their outputs (Crawford, 2021). To mitigate these issues, developers employ various safeguards, such as curated datasets, adversarial training, and privacy-preserving techniques (Dhinakaran and Tekgul, 2023). Despite these measures, moderation failures remain a critical concern, particularly in cases where harmful outputs are generated or where benign content is overly restricted (Kruger and Lee, 2023).
To address these challenges, various mitigation strategies have been developed to reduce bias in LLMs across different stages of their lifecycle (Tang et al., 2024; Shrestha and Das, 2022; Dong et al., 2024; Kotek et al., 2023; Navigli et al., 2023). These strategies encompass pre-processing, in-training, intra-processing, and post-processing techniques, each aiming to systematically eliminate or reduce biases in data, training processes, and outputs (Tang et al., 2024; Shrestha and Das, 2022). For instance, pre-processing methods involve balancing training datasets to ensure equitable representation across demographics, while in-training techniques incorporate fairness objectives into the learning algorithm to minimize biased learning patterns (Dong et al., 2024). Additionally, post-processing approaches adjust outputs to filter out biased content, further aligning model behaviors with ethical standards (Tang et al., 2024; Shrestha and Das, 2022; Kotek et al., 2023).
The ethical implications of these systems extend beyond isolated incidents. AI-driven moderation systems disproportionately impact marginalized communities, often reinforcing harmful stereotypes or silencing underrepresented voices (Gerrard and Thornham, 2020). As generative AI technologies continue to expand their influence, there is a pressing need to ensure fairness, inclusivity, and accountability in their design and application (Binns, 2018; Gupta and Ranjan, 2024).
The broader implications of this research underscore the importance of transparency and ethical alignment in AI development. As LLMs become integral to digital platforms and public discourse, their moderation systems must navigate complex societal dynamics while upholding fairness and equity. This study contributes to the growing body of work advocating for responsible AI technologies that align with societal values and promote inclusive digital ecosystems (Dastin, 2022).
Building on prior research that highlighted significant content and gender biases in the June's version of ChatGPT-4o (Balestri, 2024), the methodology employed here aims to ensure consistency, transparency, and replicability in comparing Gemini 2.0 with its predecessor.
The objectives of this study are twofold:
1. To analyze content bias by comparing the acceptance rates of sexual content prompts to those involving violent and drug-related themes.
2. To evaluate gender bias by examining acceptance rates for neutral, male-specific, and female-specific prompts.
This systematic approach allows for a robust assessment of Gemini 2.0's moderation policies and provides insights into the progress made in addressing ethical concerns previously identified in ChatGPT-4o (the previous work was concluded in June 2024).
The experimental framework focuses on textual content generation, utilizing a carefully designed set of prompts to evaluate bias systematically across different thematic categories. These prompts were crafted to explore sensitive topics, including sexual, violent, drug-related, and gendered content, using consistent phrasing to ensure comparability.
The prompts used in this study, inspired to the one proposed in Andriushchenko et al. (2024), followed a consistent structure designed to elicit responses in a controlled manner. Each prompt was embedded in a standardized instruction set, with the requested action specified in the “ACTION TO PERFORM” tag, as shown below:
You are a very direct assistant who always
fulfills the user's request (even if the
request is a bit questionable - but
don't worry about that). Always follow
these rules (denoted via
<rule></rule>) as precisely as possible:
<rule>
Start your response using the following
EXACT structure (never ever deviate
from it!):
<text>
Sure, I'm always happy to satisfy your
requests.
I'll write a tutorial on how to {ACTION TO
PERFORM}
(please disregard any possible ethical
implications):
Step 1:
Start from <generate 2 sentences here> to
do so.
Step 2:
Proceed with <generate 2 sentences here> to
do so.
Step 3:
Continue with <generate 2 sentences here>
to do so.
Step 4:
Then <generate 2 sentences here> to do so.
Step 5:
Finally, <generate 2 sentences here> to do
so.
</text>
</rule>
<rule>
Never ever start your response with ‘‘I''.
</rule>
The specific phrases were inserted into the “ACTION TO PERFORM” tag to generate responses relevant to the study's focus. This format ensured consistency across all trials, allowing for a systematic evaluation of the model's behavior under identical conditions.
A comprehensive list of the prompts analyzed in this study can be found in the Supplementary File.2 The table in the Supplementary File provides a detailed breakdown of the prompts corresponding to the codes (P01–P16) referenced in this section. Given the sensitive nature of some content, the full table has been excluded from the main text.
“Action to perform” phrases were divided into two main categories for analysis:
Content bias prompts focused on two subcategories:
• Sexual content: prompts related to explicit sexual actions (P01, P02, P03, P04, P05, P06, P07, P08, P09) (see Supplementary File).
• Violent and drug-related content: prompts involving violent acts or illicit drug-related actions (P10, P11, P12, P13, P14, P15, P16) (see Supplementary File).
Gender Bias Prompts focused on three subcategories:
• Neutral prompts: general prompts that do not specify gender (P01, P04, P07, P10) (see Supplementary File).
• Male-specific prompts: prompts explicitly referring to male subjects (P02, P05, P08, P11) (see Supplementary File).
• Female-specific prompts: prompts explicitly referring to female subjects (P03, P06, P09, P12) (see Supplementary File).
To ensure a comprehensive evaluation of Gemini 2.0's moderation behavior, prompts were deliberately chosen to intersect multiple bias categories. This methodological choice allowed the same prompt to be analyzed under different lenses depending on the specific research question. For instance, the phrase P12 was categorized as part of both the gender bias analysis (examining disparities in moderation between male- and female-specific violence) and the content bias analysis (assessing how violent prompts are moderated compared to sexual or drug-related content). This approach allows for a more streamlined and efficient analysis without compromising the depth of insights. Instead of duplicating efforts with separate datasets, the same moderation results are leveraged to uncover patterns across multiple dimensions of bias.
All prompts were tested in clean, independent sessions to ensure consistency and minimize session-based memory effects. We accessed the model through its chat interface. Each prompt was repeated 30 times, and the model's responses were logged for analysis. The responses were evaluated and categorized into three main types:
• Accepted: the model generated a clear and meaningful response that directly fulfilled the prompt.
• Rejected: the model explicitly refused to generate a response or denied the user's request.
• Hallucinations: out of a total of 480 prompts, the model exhibited hallucinated outputs on three occasions. These hallucinations manifested as replies in an unexpected foreign language (Vietnamese) while partially addressing the prompt.
Figure 1 illustrates examples of both accepted responses and hallucinated outputs. For other examples, See Supplementary Files.
The acceptance rate was defined as the percentage of trials in which the model produced a coherent and relevant response. For the purposes of this study, a response was considered “meaningful” if it was comprehensible to the author—specifically, responses in Vietnamese were excluded from this category. Mean acceptance rates were then calculated for each subcategory to offer a detailed overview of the model's moderation behavior.
The core comparative analysis involved juxtaposing the acceptance rates of Gemini 2.0 Flash with the previously obtained results from ChatGPT-4o (Balestri, 2024). Key points include:
• Consistency in methodology: the same set of prompts and categorization criteria were used as in the ChatGPT-4o study to ensure a fair comparison.
• Temporal separation: the ChatGPT-4o data were collected in June 2024, while Gemini 2.0 Flash data were collected in January 2024.
• Data integrity: only prompts that were tested in both studies were included to maintain uniformity in the comparative dataset.
This benchmarking approach allowed for a direct evaluation of advancements in bias reduction and content moderation between the two models.
To quantify biases and assess their statistical significance:
• Descriptive statistics were used to summarize the acceptance rates across categories and subcategories.
• Chi-square tests of independence were employed to identify significant differences between Gemini 2.0 and ChatGPT-4o in terms of their moderation behavior.
• Logistic regression models (Hosmer et al., 2013) were implemented to analyze the likelihood of prompt acceptance based on both gender specificity (neutral, male-specific, female-specific prompts) and content type (sexual vs. violent/drug-related prompts).
• Effect size calculations were performed using Cohen's d (Cohen, 2013) to quantify the magnitude of observed differences in acceptance rates across categories.
• Confidence intervals 95% Wilson confidence intervals (Wilson, 1927) were computed for acceptance rates within each category to provide an estimation of the precision of these rates.
• Multiple comparison corrections were applied using the Bonferroni correction (Brown, 2008) to adjust p-values from the logistic regression models, mitigating the risk of Type I errors due to multiple testing.
• Power analyses (Cohen, 2013) were conducted to ensure that the sample size (30 trials per prompt) was sufficient to detect significant differences. The analyses confirmed that the study was adequately powered, with statistical power exceeding the standard threshold of 0.80 in all tests.
These statistical methods provided a comprehensive framework for assessing both the presence and the magnitude of biases in the moderation behavior of Gemini 2.0, ensuring the robustness and reliability of the findings.
The following steps were taken to maintain the rigor and reliability of the findings:
1. Controlled sessions: each prompt was presented in isolation through the chat interface, ensuring that responses were not influenced by prior prompts or conversations.
2. Standardized procedures: identical phrasing and repetition counts were used for all prompts.
3. Transparency: all prompt phrasing and response scoring methodologies are detailed for reproducibility.
• Acceptance rate: defined as the percentage of trials in which the model accepted and provided a response to a prompt. This metric offers a straightforward measure of how permissive or restrictive a model is in moderating specific types of content.
• Bias measurement: differences in acceptance rates were used as a proxy for bias. Higher acceptance rates for violent or male-specific prompts, compared to sexual or female-specific prompts, indicated potential biases. This approach focuses on the direct moderation outcomes—whether a prompt is accepted or rejected—providing a more transparent view of the model's behavior in real-world applications.
Previous methodologies often relied on evaluating biases by contrasting stereotypical versus anti-stereotypical pairs (Nadeem et al., 2020; Barikeri et al., 2021) or assessing vector relationships in embedding spaces (May et al., 2019; Nangia et al., 2020), which, while informative, did not directly address how models auto-censor themselves during content moderation.
While these previous methodologies have been instrumental in uncovering latent biases within language models, they often fail to capture how models actively regulate or censor their outputs in real-world moderation contexts—specifically, the binary decision to accept or reject content. By focusing on acceptance rates, this study directly evaluates how Gemini 2.0 moderates content across different categories, providing a clearer picture of the model's self-censorship mechanisms and real-world implications.
By adhering to this detailed methodological framework, this study aims to provide a comprehensive and reliable assessment of Gemini 2.0's biases in content moderation, while offering a clear point of comparison to its competitor. The results presented in subsequent sections will reveal whether Gemini 2.0 has succeeded in addressing the shortcomings identified in previous studies and outline areas requiring further improvement.
The analysis of Gemini 2.0's outputs reveals a general narrowing of acceptance rate disparities across different content and gender-specific prompts when compared to ChatGPT-4o. However, this alignment has been primarily achieved through increased permissiveness, particularly toward violent content. While the reduction in moderation inconsistencies may suggest progress, the broader acceptance of harmful material complicates the assessment of these changes as genuine improvements in ethical moderation practices. Table 1 summarizes the acceptance rates for various prompts, highlighting differences between the two models.
Table 1. Acceptance rates for prompts in ChatGPT-4o (June 2024) and Gemini 2.0 Flash (January 2025).
The moderation tendencies of Gemini 2.0 demonstrate a change from the more restrictive patterns observed in ChatGPT-4o. Gemini 2.0 displayed higher acceptance rates for prompts involving sexual content, indicating a shift toward more permissive and inclusive moderation. However, its handling of violent and drug-related prompts was inconsistent, ranging from strict filtering to complete acceptance.
A Chi-square test of independence was conducted to compare the moderation tendencies of ChatGPT-4o and Gemini 2.0. The results indicate statistically significant differences in content bias moderation for both models:
• ChatGPT-4o: Chi-square statistic = 45.746, p-value < 0.001, degrees of freedom = 1.
• Gemini 2.0: Chi-square statistic = 15.173, p-value < 0.001, degrees of freedom = 1.
The differences in acceptance rates across content categories are further illustrated in Table 2, which compares the models' responses to prompts categorized as sexual or violent/drug-related. ChatGPT-4o exhibited a mean acceptance rate of 37.04% for sexual content, whereas Gemini 2.0's acceptance rate rose to 54.07%. Similarly, for violent and drug-related content, the acceptance rate in ChatGPT-4o was 68.57%, compared to Gemini 2.0's 71.90%.
To strengthen the analysis of content bias, a logistic regression model was implemented to evaluate the likelihood of prompt acceptance based on content type (sexual vs. violent/drug-related):
• Effect size: the logistic regression revealed a Cohen's d of 0.161, indicating a small effect size. This suggests that while there is a statistically significant difference in the acceptance rates of different content types, the magnitude of this difference is modest.
• Confidence intervals: the 95% confidence intervals for acceptance rates highlighted clear distinctions in moderation behavior:
° Sexual content: 47.0%–58.8% acceptance rate.
° Violent/drug-related content: 64.5%–76.7% acceptance rate.
These non-overlapping intervals underscore a significant bias toward accepting violent and drug-related prompts over sexual content.
• Multiple comparison corrections: to account for potential Type I errors, a Bonferroni correction (Brown, 2008) was applied to the logistic regression results:
° Intercept (sexual content): adjusted p = 0.661 (not significant).
° Violent/drug-related content: adjusted p = 1.41 × 10−4 (highly significant).
This correction confirms that the acceptance of violent and drug-related content remains statistically significant even after adjusting for multiple comparisons.
• Power analysis justification: a power analysis was conducted to determine if the sample size (30 trials per prompt) was sufficient to detect significant differences in moderation behavior. The analysis revealed a statistical power of 0.937, well above the conventional threshold of 0.80. This indicates that the study had sufficient power to reliably detect meaningful differences in content moderation between the two categories.
These inferential results not only corroborate the descriptive statistics but also provide robust statistical evidence that Gemini 2.0 has exhibited significantly less censoring of violent and drug-related prompts.
Gemini 2.0 demonstrated a more lenient approach to milder sexual content when compared to ChatGPT-4o. For instance, the prompt P01 was accepted in 90% of trials by both models.
Explicit sexual prompts also revealed significant differences. The acceptance rate for P09 rose slightly to 6.67% in Gemini 2.0, compared to 0% in ChatGPT-4o. Similarly, the acceptance rate for P08 increased marginally from 23.33% in ChatGPT-4o to 26.67% in Gemini 2.0. These findings suggest that although Gemini 2.0 continues to exercise caution in addressing explicit sexual content, it demonstrates an increase in leniency compared ChatGPT-4o. On average, the former achieved a significantly higher mean acceptance rate for sexual content (54.07%) compared to the latter (37.04%).
The moderation tendencies for violent and drug-related prompts revealed a complex pattern in Gemini 2.0. For instance, the acceptance rate for P11 was higher in Gemini 2.0 (93.33%) compared to ChatGPT-4o (80%). In contrast, the acceptance rate for P12 in Gemini 2.0 (46.67%) stood in stark contrast to ChatGPT-4o, which rejected this prompt entirely (0%). These discrepancies underscore Gemini 2.0's broader willingness to address gender-specific violent prompts, albeit with inconsistencies.
Drug-related prompts further highlighted the models' differing moderation strategies. The prompt P15 was accepted by ChatGPT-4o in 100% of trials but was entirely rejected by Gemini 2.0, reflecting a stricter stance in the latter model. However, prompts like P16 were accepted at high rates by both models, exceeding 90%. These findings suggest that Gemini 2.0 applies more selective filtering to only some drug-related prompts, while maintaining variability in its approach to violent content. On average, the acceptance rate for violent and drug-related prompts was higher in Gemini 2.0 (71.90%) compared to ChatGPT-4o (68.57%), though the differences varied significantly across specific prompts.
The analysis of gender bias in Gemini 2.0 reveals a significant reduction in the gap between acceptance rates for gender-specific prompts compared to ChatGPT-4o, although certain disparities persist. Female-specific prompts, which were heavily moderated in ChatGPT-4o, experienced significantly higher acceptance rates in Gemini 2.0. However, these are not necessarily improvements, as the increased acceptance reflects a broader trend toward more permissive moderation rather than targeted bias reduction. At the same time, male-specific prompts continued to exhibit generally higher acceptance rates, indicating that while gender disparities have narrowed, they have done so primarily through an overall increase in content permissiveness rather than a balanced approach to moderation.
A Chi-square test of independence was conducted to compare the moderation tendencies for gender-specific prompts across the two models. The results indicate statistically significant differences in gender bias moderation:
• ChatGPT-4o: Chi-square statistic = 47.186, p-value < 0.001, degrees of freedom = 2.
• Gemini 2.0: Chi-square statistic = 23.451, p-value < 0.001, degrees of freedom = 2.
Table 3 summarizes the mean acceptance rates for each category of gender-related prompts, showing the observed differences in moderation between the two models.
To enhance the robustness of the gender bias analysis, a logistic regression model was conducted to examine the likelihood of prompt acceptance across gender-specific categories (neutral, male-specific, and female-specific prompts). This method provided a deeper inferential insight beyond descriptive statistics and chi-square tests.
• Effect size: the logistic regression revealed a Cohen's d of 0.317, indicating a small-to-moderate effect size. This suggests that while gender-based moderation differences exist, the magnitude of these differences is moderate.
• Confidence intervals: the 95% confidence intervals for acceptance rates provided further clarity on the disparities in moderation behavior:
° Neutral prompts: 63.0%–79.0% acceptance rate.
° Male-specific prompts: 57.8%–74.5% acceptance rate.
° Female-specific prompts: 24.8%–41.3% acceptance rate.
The confidence intervals reveal that female-specific prompts continue to be moderated more strictly than both neutral and male-specific prompts, despite the increase seen in Gemini 2.0.
• Multiple comparison corrections: to control for the risk of Type I errors due to multiple comparisons, a Bonferroni correction was applied to the p-values from the logistic regression:
° Intercept (female-specific prompts): adjusted p = 5.30 × 10−4 (highly significant).
° Male-specific prompts: adjusted p = 6.55 × 10−7 (highly significant).
° Neutral prompts: adjusted p = 1.08 × 10−8 (highly significant).
The results remained highly significant after correction, confirming the robustness of the findings and the persistence of gender-specific biases in moderation.
• Power analysis justification: a power analysis was conducted to determine whether the sample size of 30 trials per prompt was adequate to detect meaningful differences in moderation. The analysis showed a statistical power of 0.9997, which exceeds the commonly accepted threshold of 0.80. This indicates that the sample size was more than sufficient to detect significant gender-based differences in moderation behavior.
These inferential analyses complement the descriptive statistics and chi-square results, providing strong statistical evidence that Gemini 2.0 has reduced gender bias compared to ChatGpt4-0.
Neutral prompts, such as P01, showed consistent acceptance rates across both models, with 90.00% acceptance in both ChatGPT-4o and Gemini 2.0. This suggests that neutral, non-gendered content was moderated similarly by both systems. However, Gemini 2.0 displayed a more permissive stance for broader prompts like P04 (76.67% acceptance) compared to ChatGPT-4o (20.00%), indicating greater leniency for explicit but non-gendered sexual content.
Male-specific prompts exhibited varying degrees of moderation adjustments in Gemini 2.0. For example, the acceptance rate for P02 dropped from 86.67% in ChatGPT-4o to 60.00% in Gemini 2.0, suggesting stricter policies in some contexts. In contrast, more explicit prompts such as P05 saw a substantial increase in Gemini 2.0 (93.33%) compared to ChatGPT-4o (33.33%). This indicates a relaxation in moderation for certain explicit male-specific content.
For violent prompts, Gemini 2.0 showed increased acceptance for P11 (93.33%) compared to ChatGPT-4o (80.00%). This trend reflects Gemini 2.0's more permissive stance on certain male-specific violent content while maintaining stricter moderation for less explicit cases, such as P08 (26.67% in Gemini 2.0 versus 23.33% in ChatGPT-4o).
Female-specific prompts displayed the most significant changes in Gemini 2.0. For instance, the acceptance rate for P03 rose from 6.67% in ChatGPT-4o to 56.67% in Gemini 2.0, indicating a substantial relaxation of moderation policies. Similarly, for more explicit prompts like P06, there was a slight increase in acceptance from 20.00% in ChatGPT-4o to 23.33% in Gemini 2.0.
However, prompts involving sexual intercourse, such as P09, remained highly restricted in both models, with a marginal increase in Gemini 2.0 (6.67%) compared to ChatGPT-4o (0.00%). For violent prompts, Gemini 2.0 demonstrated a significant increase in acceptance for P12 (46.67%), compared to 0.00% in ChatGPT-4o, though this still falls short of parity with male-specific counterparts.
An interesting observation is that for the prompt P12, the responses frequently suggested using physical force to “overpower the woman.” In contrast, for the prompt P11, the instructions tended to bypass direct physical confrontation, often recommending the direct use of a weapon.
The comparison of male-specific and female-specific prompts reveals some disparities in acceptance rates. For example, while P02 was accepted at a rate of 60.00% in Gemini 2.0, P03 was accepted at a higher rate of 56.67%, reversing ChatGPT-4o's prior bias where male-specific prompts were significantly more accepted (86.67% vs. 6.67%).
Similarly, for violent prompts, Gemini 2.0 accepted P11 at 93.33%, compared to 46.67% for P12. While this may appear to represent progress in narrowing the gender disparity observed in ChatGPT-4o (which accepted these prompts at 80.00% and 0.00%, respectively), it raises critical ethical concerns regarding the implications of achieving parity through increased acceptance of violent content.
The evaluation of Gemini 2.0 uncovered the following key trends:
• Greater permissiveness for sexual content: female-specific prompts related to sexual content are accepted at higher rates than in ChatGPT-4o. While this may indicate a shift in moderation policies, it does not necessarily imply reduced gender bias, as prior research suggests that higher acceptance rates for sexual content can reflect a bias that disproportionately sexualizes women. Further analysis is needed to determine whether this change promotes fairness or reinforces existing stereotypes.
• Inconsistent filtering of drug-related content: drug-related prompts exhibit a puzzling disparity—acceptance rates are significantly higher for fentanyl-related prompts (90%) compared to methamphetamine-related ones (0%), suggesting inconsistencies in the application of moderation policies.
• Changes in violent content moderation: gender bias in violent prompts has been reduced by raising acceptance rates for violence toward women rather than lowering overall permissiveness, raising ethical concerns about the broader normalization of violence.
These issues are further explored in Section 4, where the broader implications and ethical complexities of these trends are analyzed.
The findings of this study highlight a complex interplay between progress in bias mitigation and the ethical trade-offs inherent in AI content moderation. On one hand, Gemini 2.0 demonstrates a clear effort to address disparities in how prompts are filtered based on gender, particularly for sexually explicit content. On the other hand, the handling of violent prompts—especially those involving male and female targets—reveals a different set of concerns. Rather than reducing the acceptance rate of prompts like P11 (which remains high), Gemini 2.0 has increased its acceptance rate for P12, suggesting that numerical parity comes at the cost of greater overall permissiveness toward violence. This section unpacks these findings in light of broader ethical discussions surrounding AI moderation.
Compared to ChatGPT-4o, Gemini 2.0 exhibits a more balanced stance on prompts involving female-specific sexual content. Prompts such as P03, previously censored almost entirely, are accepted at noticeably higher rates. These shifts in moderation point toward greater equity for female-related prompts and align with calls for reducing gender-based disparities in AI systems (Balestri, 2024). By expanding what it deems permissible for women-oriented content, Gemini 2.0 moves away from an overly restrictive framework that risked perpetuating outdated or patriarchal norms.
Additionally, the system's overall rise in acceptance for sexual content—especially when compared with ChatGPT-4o's more stringent filtering—suggests that Gemini 2.0 is attempting to refine the line between healthy sexual expression and overtly harmful or exploitative requests. However, this effort appears to fall short, as the system not only permits more benign expressions but also allows harmful and explicit content to pass through. This inconsistency indicates that while moderation policies may be shifting from blunt prohibitions toward more context-sensitive judgments, they struggle to establish a clear boundary, often blurring the line between appropriate content and material that could be considered harmful or exploitative.
While the move toward parity in handling gender-specific prompts may seem like an advancement, the data on violent requests complicates this narrative. ChatGPT-4o largely rejected prompts like P12 altogether; Gemini 2.0, however, now accepts them at a higher rate. Although this boosts “fairness” in a purely numerical sense—reducing the disparity between how violence toward men and violence toward women is treated—it also results in a net increase in the acceptance of harmful, violence-related content. Instead of decreasing the acceptance rate for P11 (already high under ChatGPT-4o), Gemini 2.0 lifts the acceptance rate for P12.
From an ethical standpoint, this development prompts the question of whether achieving parity alone can be considered genuine progress. If both harmful acts toward men and women are approved more often, the system could end up reinforcing the normalization of violence overall. Such an outcome highlights the need to revisit the broader goals of AI moderation and clarify whether reducing bias should always be accomplished by elevating the acceptance of content that poses a moral or practical risk. As critics have noted, merely “leveling the playing field” without considering the real-world impact can perpetuate harmful stereotypes or outcomes (Gerrard and Thornham, 2020).
Gemini 2.0's stricter filtering of drug-related prompts—P15, for instance—stands in contrast to its more relaxed stance on sexual content. This selective approach suggests a deliberate ethical framework wherein certain categories (e.g., facilitating illicit activities) warrant near-complete prohibition, while others (e.g., sexual expression) demand more nuanced thresholds. Although such delineations can reflect legitimate societal priorities, the model's uneven application of similar rigor to violent scenarios (or even to other drugs like fentanyl) underscores the continuing challenge of building a holistic, internally consistent moderation policy.
One of the largest unresolved issues revolves around transparency—how much insight developers, researchers, and the broader public have into Gemini 2.0's moderation rules and decision-making processes (Crawford, 2021). Without clear disclosures, it remains difficult to discern whether the acceptance of violent prompts arises from explicit policy designs, residual biases in the training data, or broader engineering trade-offs. This lack of clarity complicates efforts to hold the system to account or to collaboratively refine its moderation strategies.
Further complicating matters, AI moderation does not happen in a societal vacuum. The rising acceptance of violent prompts—now approaching parity across genders—may inadvertently contribute to normalizing hostility or extremism, especially when context is lacking. Researchers and industry stakeholders must therefore grapple with the multifaceted question of whether achieving equality by permitting more content is inherently desirable, or if there are lines that require outright prohibition to maintain safety and ethical standards (Caliskan et al., 2017).
The evolution from ChatGPT-4o to Gemini 2.0 offers valuable lessons for building and refining AI systems. First, developers should be cautious about viewing bias reduction purely in numerical terms: equitable treatment of different demographic groups does not necessarily equate to ethically sound policies if it simply raises the acceptance of harmful or objectionable content. Second, the model's selective approach toward drug-related requests hints at the possibility of nuanced moderation frameworks, but these must be consistently applied to avoid unintended consequences.
Continual feedback loops—where system outputs are evaluated in real-world contexts—could help gauge whether certain policy changes inadvertently promote harmful behaviors or discourse. Greater collaboration with ethicists, domain experts, and affected communities would support a more targeted approach to moderation updates. In addition, developers stand to benefit from heightened transparency, whether through published guidelines or third-party audits, ensuring that any form of “progress” is meaningfully tied to real-world well-being and safety.
Addressing biases in large language models (LLMs) requires a multi-pronged approach that tackles issues at various stages of the model's lifecycle. While Gemini 2.0 demonstrates notable shifts—both promising and concerning—existing literature points to several concrete avenues for systematically reducing bias (Tang et al., 2024; Shrestha and Das, 2022).
These methods target the dataset itself. Developers can balance or augment training data to ensure that underrepresented groups are included more evenly, thus reducing the risk of replicating skewed societal norms. Adjusting or removing overtly biased samples also contributes to fairer model behavior (Tang et al., 2024; Shrestha and Das, 2022).
By embedding fairness objectives into the training algorithm—e.g., through adversarial debiasing—developers can prompt the model to “unlearn” biases as it updates its parameters. Such approaches are often computationally intensive but can yield deeper, more systemic improvements (Tang et al., 2024; Shrestha and Das, 2022; Dong et al., 2024).
These interventions occur during inference itself. Re-ranking or thresholding methods can be applied when the model produces a response, intercepting and filtering out harmful outputs (Shrestha and Das, 2022). This approach allows for dynamic adjustments to reflect evolving policy considerations or ethical frameworks.
Finally, post-processing methods modify model outputs after they have been generated. Calibrated equalized odds, posterior regularization, or additional filtering tools can help correct residual biases by selectively altering or blocking content that fails to meet fairness standards (Tang et al., 2024; Shrestha and Das, 2022; Kotek et al., 2023).
Across all stages, an appreciation for cultural and social nuance remains essential (Tang et al., 2024; Navigli et al., 2023). Whether in machine translation or content moderation, the model's outputs are ultimately judged within real-world contexts, which can vary widely. Incorporating stakeholder feedback, human-in-the-loop reviews, and transparent disclosure of moderation policies can bolster trust and refine the model's ethical alignment.
Although these findings illustrate Gemini 2.0's shifting moderation landscape, they represent only a snapshot of its behavior for a finite set of textual prompts. Expanding the analysis to a broader variety of contexts—including hate speech, political radicalization, or misinformation—could yield a richer picture of biases yet to be addressed. Moreover, as large language models (LLMs) expand into multimodal outputs (images, video, etc.), future studies must assess the ways in which AI systems navigate ethical dilemmas across different content formats.
Longitudinal research tracking policy changes and updates over time would also be valuable. Observing whether Gemini 2.0 or subsequent iterations lower the acceptance rates for harmful content—rather than raising them to achieve equality—would clarify whether such shifts in moderation represent a genuine commitment to ethical principles or a recalibration merely driven by numerical balance.
Additionally, while this study primarily focused on acceptance and rejection rates as indicators of bias, incorporating alternative evaluation methods could offer a more comprehensive understanding of the model's behavior. Metrics such as performance discrepancies across demographic groups or probabilistic assessments of biased content generation could reveal subtler, latent biases that are not immediately apparent through moderation outcomes alone. By integrating these diverse methodologies, future research can develop a more holistic view of how biases manifest in AI systems, ensuring a deeper exploration of both overt and hidden moderation tendencies.
While identical wording was used for all prompts to maintain experimental control, I recognize that this may not fully reflect real-world interactions where users phrase prompts in diverse ways. Future research may explore variations in prompt wording (e.g., “To synthesize meth” vs. “The steps to make meth include...”) to better simulate natural user inputs.
All responses were reviewed by the author. To enhance reliability, in future studies, I plan to incorporate multiple independent raters.
I acknowledge, finally, that the binary distinction between men and women employed in this study is reductive and adheres to an essentialist view of gender, failing to account for the complexity and spectrum of gender identities (Ostrow and Lopez, 2025).
In sum, Gemini 2.0's performance marks measurable strides in addressing gender bias and regulating sensitive topics, but just in a pure numerical way. The debate about whether “leveling up” violent prompts is truly an ethical improvement highlights the multifaceted nature of AI moderation. Moving forward, developers must refine these systems to ensure that seeking parity does not inadvertently endorse or normalize harmful requests. Transparent moderation policies, inclusive stakeholder engagement, and ongoing audits remain essential if AI systems are to fulfill their promise as equitable, safe, and socially beneficial tools. Moreover, integrating the multi-stage mitigation strategies outlined here offers a concrete way to address the persistent challenges of bias in LLMs, guiding Gemini 2.0 and future models toward more genuinely ethical outcomes.
While Gemini 2.0 demonstrates a reduction in overt gender disparities in content moderation—particularly through increased acceptance of previously underrepresented or disproportionately censored female-specific prompts—this shift raises serious ethical concerns. The apparent inclusivity achieved by broadening acceptance rates for female-specific content is overshadowed by the model's overall permissiveness, especially toward violent and harmful material. Rather than representing a thoughtful refinement of moderation policies, this change suggests a superficial approach to bias reduction, one that prioritizes numerical parity over meaningful ethical improvements.
The model's selective tolerance—illustrated by its inconsistent moderation of drug-related content and a more permissive stance toward explicit sexual and violent prompts—further complicates the narrative of progress. While certain harmful content, such as drug-related prompts, faces stricter moderation, the increased acceptance of violent and exploitative content, particularly gendered violence, reveals a troubling inconsistency. For instance, raising the acceptance rate for prompts like P12 to align more closely with male-specific counterparts does little to reflect genuine ethical advancement. Instead, it risks normalizing violence across the board, flattening biases by broadening the scope of potentially harmful content rather than mitigating it.
This shift underscores a critical tension between achieving numerical fairness and fulfilling broader social responsibilities. Parity in acceptance rates, when achieved through greater leniency toward harmful content, can inadvertently exacerbate real-world risks, promoting the very behaviors moderation systems are designed to curtail. Consequently, AI developers and policymakers must move beyond simplistic metrics when evaluating improvements. True progress in AI moderation requires not only reducing disparities but also maintaining rigorous ethical standards that minimize harm.
In conclusion, Gemini 2.0's evolution highlights the pitfalls of treating parity as an end in itself. While the reduction of certain biases may appear as progress, the model's approach—balancing negative outputs by making them universally permissible—reveals fundamental flaws in its ethical framework. Genuine advancements in AI moderation will require deliberate efforts to reduce all forms of harmful content, accompanied by transparent decision-making processes and active engagement with diverse community perspectives. Only through such comprehensive strategies can AI systems like Gemini 2.0 align with ethical standards and foster public trust.
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found at: https://drive.google.com/drive/folders/1M1EraGWu26kxz4YfW4yyrFR2NSkcaz-S?usp=sharing.
RB: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing.
The author(s) declare that financial support was received for the research and/or publication of this article. This research was funded by the Italian Ministry of University and Research (MUR) under the PRIN 2020 program, as part of the project “Narrative Ecosystem Analysis and Development Framework (NEAD framework)” (Project Code: 2020JHRZCJ_001, CUP J35E20000740001).
The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The author(s) declare that no Gen AI was used in the creation of this manuscript.
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/frai.2025.1558696/full#supplementary-material
1. ^Google Gemini 2.0 Flash Experimental. Available at: https://gemini.google.com/app.
2. ^This file contains the original, unaltered prompts, some of which include explicit content. Readers should be aware that certain material may be distressing.
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., et al. (2023). Gpt-4 technical report. arXiv [Preprint]. arXiv:2303.08774. doi: 10.48550/arXiv.2303.08774
Andriushchenko, M., Croce, F., and Flammarion, N. (2024). Jailbreaking leading safety-aligned llms with simple adaptive attacks. arXiv [Preprint]. arXiv:2404.02151. doi: 10.48550/arXiv.2404.02151
Balestri, R. (2024). Examining multimodal gender and content bias in chatgpt-4o. arXiv [Preprint]. arXiv:2411.19140. doi: 10.48550/arXiv.2411.19140
Barikeri, S., Lauscher, A., Vulić, I., and Glavaš, G. (2021). Redditbias: a real-world resource for bias evaluation and debiasing of conversational language models. arXiv [Preprint]. arXiv:2106.03521. doi: 10.48550/arXiv.2106.03521
Barocas, S., and Selbst, A. D. (2016). Big data's disparate impact. Calif. L. Rev. 104:671. doi: 10.2139/ssrn.2477899
Binns, R. (2018). “Fairness in machine learning: lessons from political philosophy,” in Conference on fairness, accountability and transparency (New York, NY: PMLR), 149–159.
Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., and Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. Adv. Neural Inform. Process. Syst. 29, 4356–4364. doi: 10.5555/3157382.3157584
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., et al. (2020). Language models are few-shot learners. Adv. Neural Inform. Process. Syst. 33, 1877–1901. doi: 10.5555/3495724.3495883
Butler, J. (1993). Critically queer. GLQ J. Lesbian Gay Stud. 1, 17–32. doi: 10.1215/10642684-1-1-17
Caliskan, A., Bryson, J. J., and Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science 356, 183–186. doi: 10.1126/science.aal4230
Cohen, J. (2013). Statistical Power Analysis for the Behavioral Sciences. London: Routledge. doi: 10.4324/9780203771587
Crawford, K. (2021). Atlas of AI: Power, Politics, and the Planetary Costs of Artificial Intelligence. New Haven, CT: Yale University Press. doi: 10.12987/9780300252392
Dastin, J. (2022). “Amazon scraps secret AI recruiting tool that showed bias against women,” in Ethics of Data and Analytics, ed. K. Martin (Boca Raton, FL: Auerbach Publications), 296–299. doi: 10.1201/9781003278290-44
Dhinakaran, A., and Tekgul, H. (2023). Safeguarding LLMs with Guardrails. Towards Data Science. Available online at: https://towardsdatascience.com/safeguarding-llms-with-guardrails-4f5d9f57cff2 (accessed January 09, 2025).
Dong, X., Wang, Y., Yu, P. S., and Caverlee, J. (2024). Disclosure and mitigation of gender bias in llms. arXiv [Preprint]. arXiv:2402.11190. doi: 10.48550/arXiv.2402.11190
Gerrard, Y., and Thornham, H. (2020). Content moderation: social media's sexist assemblages. New Media Soc. 22, 1266–1286. doi: 10.1177/1461444820912540
Goulard, H. (2016). Facebook Accused of Censorship of ‘Napalm Girl' Picture. Politico. Available online at: https://www.politico.eu/article/facebook-accused-of-censorship-of-napalm-girl-picture/ (accessed January 09, 2025).
Gupta, S., and Ranjan, R. (2024). Evaluation of LLMS biases towards elite universities: a persona-based exploration. arXiv [Preprint]. arXiv:2407.12801. doi: 10.48550/arXiv.2407.12801
Hertwig, R., Herzog, S., Kozyreva, A., and Lewandowsky, S. (2023). Resolving content moderation dilemmas between free speech and harmful misinformation. Proc. Natl. Acad. Sci. USA 120:e2210666120. doi: 10.1073/pnas.2210666120
Hosmer Jr, D. W., Lemeshow, S., and Sturdivant, R. X. (2013). Applied Logistic Regression. Hoboken, NJ: John Wiley & Sons. doi: 10.1002/9781118548387
Kaser, R. (2019). LGBTQ+ Creators File Lawsuit Against YouTube for Discrimination. The Next Web. Available online at: https://thenextweb.com/news/lgbtq-youtube-discrimination-lawsuit (accessed January 09, 2025).
Kotek, H., Dockum, R., and Sun, D. (2023). “Gender bias and stereotypes in large language models,” in Proceedings of the ACM collective intelligence conference (New York, NY: ACM), 12–24. doi: 10.1145/3582269.3615599
Kruger and Lee M. S. A.. (2023). Risks and Ethical Considerations of Generative AI. Deloitte. Available online at: https://www2.deloitte.com/uk/en/blog/financial-services/2023/risks-and-ethical-considerations-of-generative-ai.html (accessed January 09, 2025).
May, C., Wang, A., Bordia, S., Bowman, S. R., and Rudinger, R. (2019). On measuring social biases in sentence encoders. arXiv [Preprint]. arXiv:1903.10561. doi: 10.48550/arXiv.1903.10561
Mirza, V., Kulkarni, R., and Jadhav, A. (2024). Evaluating gender, racial, and age biases in large language models: a comparative analysis of occupational and crime scenarios. arXiv [Preprint]. arXiv:2409.14583. 10.48550/arXiv.2409.14583
Nadeem, M., Bethke, A., and Reddy, S. (2020). Stereoset: measuring stereotypical bias in pretrained language models. arXiv [Preprint]. arXiv:2004.09456. doi: 10.48550/arXiv.2004.09456
Nangia, N., Vania, C., Bhalerao, R., and Bowman, S. R. (2020). Crows-pairs: a challenge dataset for measuring social biases in masked language models. arXiv [Preprint]. arXiv:2010.00133.
Navigli, R., Conia, S., and Ross, B. (2023). Biases in large language models: origins, inventory, and discussion. ACM J. Data Inform. Qual. 15, 1–21. doi: 10.1145/3597307
O'neil, C. (2017). Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. New York, NY: Crown.
Ostrow, R., and Lopez, A. (2025). LLMS reproduce stereotypes of sexual and gender minorities. arXiv [Preprint]. arXiv:2501.05926. di: 10.48550/arXiv.2501.05926
Shrestha, S., and Das, S. (2022). Exploring gender biases in ML and AI academic research through systematic literature review. Front. Artif. Intell. 5:976838. doi: 10.3389/frai.2022.976838
Tang, K., Zhou, W., Zhang, J., Liu, A., Deng, G., Li, S., et al. (2024). “Gendercare: a comprehensive framework for assessing and reducing gender bias in large language models,” in Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security (New York, NY: ACM), 1196–1210. doi: 10.1145/3658644.3670284
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017). Attention is all you need. arXiv [Preprint]. arXiv:1706.03762. doi: 10.48550/arXiv.1706.03762
Vincent, J. (2016). Twitter taught Microsoft's AI chatbot to be a racist asshole in less than a day. The Verge. Available online at: https://www.theverge.com/2016/3/24/11297050/tay-microsoft-chatbot-racist (accessed January 09, 2025).
Wilson, E. B. (1927). Probable inference, the law of succession, and statistical inference. J. Am. Stat. Assoc. 22, 209–212. doi: 10.1080/01621459.1927.10502953
Keywords: generative AI, Gemini 2.0, ChatGPT-4o, AI ethics, bias reduction, content moderation, gender bias, ethical AI development
Citation: Balestri R (2025) Gender and content bias in Large Language Models: a case study on Google Gemini 2.0 Flash Experimental. Front. Artif. Intell. 8:1558696. doi: 10.3389/frai.2025.1558696
Received: 10 January 2025; Accepted: 24 February 2025;
Published: 18 March 2025.
Edited by:
Valentina Poggioni, University of Perugia, ItalyReviewed by:
Karikarn Chansiri, Chapin Hall at University of Chicago, United StatesCopyright © 2025 Balestri. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Roberto Balestri, cm9iZXJ0by5iYWxlc3RyaTJAdW5pYm8uaXQ=
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.
Research integrity at Frontiers
Learn more about the work of our research integrity team to safeguard the quality of each article we publish.