The Escalating Challenge of Online Hate Speech

In an increasingly interconnected world, the proliferation of hate speech has shifted from in-person interactions to the vast and often anonymous digital realm. This transition has enabled hateful content to spread with unprecedented speed and reach, amplified by social media platforms. On June 18, as the United Nations observes the International Day for Countering Hate Speech, UN Secretary-General Antonio Guterres highlighted the critical role social platforms play in exacerbating this threat. With artificial intelligence (AI) increasingly deployed to detect and remove such content, a closer examination reveals the inherent limitations and discrepancies in AI systems compared to human judgment.

Defining Hate Speech in the Digital Age

The United Nations provides a comprehensive definition of hate speech, encompassing any form of communication—be it spoken, written, or behavioral—that discriminates against or incites violence toward an individual or group. This definition is broad, targeting a person's actual or perceived identity, including race, ethnicity, religion, gender, sexual orientation, or disability. Critically, hate speech is not confined to verbal expressions; it can manifest through images, cartoons, gestures, and even objects, underscoring the multifaceted nature of this pervasive issue.

Prevalence and Impact of Online Hate Speech

A joint survey conducted in 2023 by Ipsos and UNESCO, involving 8,000 participants across 16 countries, revealed the alarming prevalence of online hate speech. Over two-thirds of internet users reported encountering hate speech online. The survey also indicated that LGBTQI individuals were perceived to be the most frequent targets, cited by 33 percent of respondents, followed by ethnic and racial minorities (28 percent) and women (18 percent).

Social media companies exhibit varying approaches and effectiveness in combating this issue. Meta, the parent company of Facebook and Instagram, has shown a notable decrease in the proactive removal of hateful posts since 2023. In the final quarter of 2025, Meta removed approximately 1.3 million posts from both Instagram and Facebook, a significant reduction compared to the 7.4 million from Instagram and 5.8 million from Facebook in the same period of 2024. This shift reflects a move away from proactive detection towards greater reliance on user reports for identifying hate speech. In contrast, TikTok reported a more proactive stance, claiming to have removed 96.3 percent of all hate speech and related content in the fourth quarter of 2025 before it was even reported by users.

AI's Role and Its Inconsistencies in Detection

In the effort to combat the widespread dissemination of hate speech, social media platforms have increasingly integrated AI-powered content moderation systems, primarily utilizing large language models (LLMs). These systems are designed to automate content filtering across vast quantities of messages by employing labeled datasets and pretrained language models to identify abusive language. They then apply predefined rules or score thresholds to determine whether content constitutes hate speech or violates platform policies.

However, a 2025 study by researchers at the University of Pennsylvania highlighted significant inconsistencies and biases in how these AI moderation systems identify and classify hate speech. The study evaluated seven prominent AI moderation systems, including those from OpenAI, Anthropic, DeepSeek, Mistral, and Google. It uncovered considerable differences in how these models scored the severity of hate speech across various demographic groups, raising concerns about potential unequal protection online.

“If two systems produce different outcomes for the same piece of content – flagging it as hate speech in one case but not in another – it undermines the legitimacy of the moderation process.”

For instance, the Mistral Moderation Endpoint frequently assigned very high scores, close to 1, indicating that it labeled many examples as highly hateful irrespective of the target group. Conversely, the OpenAI Moderation Endpoint often produced much lower scores for numerous categories, sometimes less than half of what other models assigned. Such discrepancies, as the study authors noted, undermine the credibility and fairness of the moderation process.

The Nuances AI Models Often Miss

While AI systems can effectively detect explicit hate speech, particularly when profanities and slurs are directly used against a specific group, they frequently falter with more nuanced forms of expression. Arkaitz Zubiaga, an associate professor at Queen Mary University of London and co-lead of the university’s Social Data Science lab, explained this challenge to Al Jazeera.

“One challenging example is the case of implicit hate speech, which is often not detected as such because it contains no mention of slurs,” Zubiaga stated. He elaborated on instances where a positive-sounding opening, such as “I would love to see how great the world would be if…” is followed by a derogatory message targeting a demographic group. AI systems, by focusing on the seemingly positive initial phrasing, can struggle to recognize the underlying hateful intent.

Conversely, AI models also tend to misclassify reclaimed language as hate speech. Zubiaga highlighted how historically offensive terms, when embraced and repurposed by marginalized communities, are used internally among members for endearing purposes. Despite this shift in meaning and context, AI systems often flag such instances as hateful, failing to grasp the socio-linguistic evolution and the specific context of their use within a community. This illustrates a significant gap in AI's ability to understand the complex, evolving nature of human language and its cultural nuances.

Source: Al Jazeera