By CDT Intern Farhana Shahid
Around 75% of Internet users are from non-English speaking countries in the Majority World (i.e., Global South). Yet social media companies allocate most of their content moderation resources to English speaking populations in the West. The disparity in platforms’ content moderation efforts has led to human rights violations and unjust moderation outcomes in the Majority World. To fill this critical gap, researchers from these regions have focused on boosting automated detection of harmful content in local languages that are often underrepresented in digital spaces and lack robust technological support.
To better understand the challenges faced by researchers in the Majority World while addressing online harms, we interviewed 12 researchers specializing in three low-resource languages: Tamil from South Asia, Kiswahili in East and Central Africa, and Quechua in South America. These researchers use Natural Language Processing (NLP) to improve computer’s understanding of low-resource languages and focus on detecting misinformation, hate speech, and spam that need to be moderated.
Our investigation reveals a troubling trend: tech companies are withholding crucial data from the researchers, hindering the development of automated content moderation technologies for low-resource languages. This is compounded by colonial biases in NLP research, which impede effective moderation of harmful content in non-English contexts, affecting the whole pipeline, from data curation and annotation to training AI models capable of understanding local dialects. However, the NLP researchers we interviewed believe that the status quo could be improved through partnerships between researchers in the Majority World, who have a better understanding of the socio-cultural nuances of online harms in their regions, and social media giants, who control the resources needed to make meaningful change.
The NLP researchers working in Tamil, Kiswahili, and Quechua stated that the biggest roadblock in addressing online harms is the lack of high-quality digital data. This stems from the colonial legacy in NLP, which favors digital inclusion of English and a handful of European languages, while neglecting linguistic diversity in the Majority World. The researchers shared that they relied on user-generated content on social media, which is often the only source of digital data they can find online. However, they pointed out that the limited access to data tech companies offered in the past was barely enough to train AI models for low-resource languages.
African NLP researchers working on Kiswahili complained that tech companies often denied them access to data if they did not have prior publications, which is difficult for these researchers since they lack funding to support and publish their work on low-resource languages. Things worsened when tech companies began charging exorbitant fees for data access, axed researchers’ access to existing data, and blocked open-source tools that independent researchers used to scrape online content. The researchers stressed that this put them in a difficult spot because they cannot afford the high costs and are unable to access any data.
This manifestation of the resource gap is deeply rooted in the colonial legacy, which prioritizes strengthening Western institutions as knowledge-producers and solution-makers to global problems, rather than building local research capacity. Tech companies, many of which are based in the West, exacerbate this gap further by gatekeeping and monetizing user-generated content in low-resource languages.
In response to these challenges, NLP researchers invented creative, community-led processes to gather data. Tamil NLP researchers initiated voluntary data donation by WhatsApp users to study misinformation in India. Due to lack of sufficient digital texts in Quechua, researchers collaborated with native speakers, who donated their speech data and helped with manual transcription. However, the researchers felt helpless without funding and could not fairly compensate community members for their contribution to sustain such elaborate data collection processes.
Despite community-wide interest to develop automated moderation technologies for non-English content, progress has been slow due to the lack of high-end computing devices. Although many researchers have relied on Google Colab’s free computing resources, they argued that the allotted time and memory are insufficient to effectively train language models with billions of parameters. They also discussed the difficulties of working with existing AI models, which are developed tailoring to data-rich languages and do not translate well to the context of low-resource languages.
Historically, biases have led to poor digital support for languages with non-Latin scripts. This has forced people in the Majority World to use Latin alphabets to write their languages, resulting in widespread code-mixing (i.e., the combination of two or more languages in the same text). Code-mixed texts in Tanglish (Tamil-English), Sheng (Kiswahili-English), and Quechuañol (Quechua-Spanish) are ubiquitous online. However, the researchers pointed out that existing AI models fail to capture this language phenomena in the Majority World. Being primarily trained on English, these models perform poorly on code-mixed texts because they struggle to identify similarities and relationships among romanized words from different languages.
The NLP researchers discouraged translating code-mixed texts to train AI models because inaccurate translations of low-resource languages would introduce biases in tasks such as hate speech detection. Although multilingual language models can handle code-mixing to some extent, these models still perform worse in low-resource languages compared to high-resource languages. The researchers explained that most AI models break down sentences into tokens based on how frequently certain symbols or words appear together. This works well for languages with ample data but not as well with low-resource languages. They reported that since English has relatively simple word structures (i.e., morphologically poor), multilingual language models produce erroneous connections when applied to languages like Tamil, Kiswahili, and Quechua that have more complex word formations (i.e., they are morphologically rich). Some of these researchers found that tokenizing sentences based on word structures and formations generates better results for data-scarce and morphologically rich languages.
Additionally, the systematic assumptions in current AI pipelines do not account for the cultural nuances of hate speech in the Majority World. Since most NLP tools mishandle emojis, earlier NLP studies on English removed emojis during data preprocessing and this got standardized. However, the Tamil NLP researchers we interviewed found that such removal worsens hate speech detection because local people use algospeak (mix of letters, emojis, or special characters instead of actual words) to evade moderation while spreading communal hate speech. This prompted the researchers to integrate the context of emojis in hate speech detection. Another example involves the use of sentiment analysis to automatically annotate datasets, which overlooks religious and ethnic hate speech that uses positive sentiment to promote supremacist ideologies. Hence, the researchers suggested involving community members targeted by hate speech in the annotation process to ensure the annotation guidelines accurately reflect the experiences of the affected community.
Our discussions highlight the systemic biases and inequities that impede content moderation research in the Majority World. Social media companies have abundant data and extensive computing power, but they fail to mitigate online harms in these regions. These companies perpetuate data colonialism by profiting from the data generated by free-user labor, while denying data access to non-Western researchers, further entrenching global inequalities in digital content moderation. Hence, the researchers urged for fairer access to resources and a shift in power to support grassroots efforts in combating harmful content.
The ongoing power asymmetry in content moderation entails outsourcing low-wage moderation tasks to the Majority World, while the more prestigious task of developing content moderation technologies remains in the hands of Silicon Valley engineers. To challenge this colonial division of labor and improve moderation technologies for low-resource languages, companies need to collaborate with NLP researchers in the Majority World, who have the necessary linguistic and cultural expertise. However, the researchers insisted that the companies must not appropriate their work and exploit the volunteer labor of community members, who play an integral part in data curation and annotation processes.
Since researchers in the Majority World are resource-strapped, to create a level-playing field, tech companies should waive the cost of accessing data and the requirement of prior publications for these researchers. They should establish transparent data request procedures and implement safeguards for responsible use of user data from these regions. Tech companies like Google could support researchers by providing additional access to computing resources on Colab. These companies could offer research grants to build local capacity for detecting harmful content in low-resource languages. This will enable researchers to remunerate community members contributing to different stages of the moderation pipeline.
Content moderation is hard, but for low-resource languages, it is an even steeper uphill battle. However, tech companies and local researchers working in non-Western contexts can complement each other’s efforts to promote community-centric and language-aware approaches to moderation in the Majority World.