The New ChatGPT Has a Huge Problem in Chinese

Dirty Data

A pollution problem with OpenAI training data has rendered its new chatbot's Chinese outputs chock-full of porn and spam, the MIT Technology Review reports.

Last week, OpenAI released GPT-4o, a decidedly flirty new large language model (LLM) equipped with new and advanced capabilities — for example, the ability to "see" through users' device cameras, as well as the power to converse out loud in real-time. But for all of GPT-4o's apparent advancements, it seems to have at least one massive blindspot: the Chinese language.

To train AI models, you need tokens, or units of data that represent information that an AI uses to "read" and learn. According to MIT Tech, AI researchers were quick to discover that nearly all of the 100 longest Chinese-language tokens used by the AI to decipher Chinese prompts were comprised of spammy porn and gambling content — resulting in bizarre, smut- and spam-ridden responses to completely run-of-the-mill queries.

"This is sort of ridiculous," Tianle Cai, an AI researcher and PhD candidate at Princeton, wrote in a Github post showcasing the polluted tokens.

Unforced Error

The worst part? According to experts, the problem of uncleaned data is a well-known AI training hurdle — and likely wouldn't have been too hard to fix.

"Every spam problem has a solution," Deedy Das, an AI investor at Menlo Ventures who formerly worked on Google's Search team, told MIT Tech, adding that just auto-translating tokenized content to detect certain problematic keywords could feasibly "get you 60 percent of the way" to a clean dataset.

"At the end of the day," he continued, "I just don't think they did the work in this case."

"The English tokens seem fine," Cai, the Princeton researcher, told MIT Tech, "but the Chinese ones are not."

In other words, the likeliest reason for OpenAI's error is that ensuring its Chinese-language tokens were mostly free of porn and gambling spam just didn't make the to-do list.

It's a bad look for OpenAI. The Chinese language has the most native speakers on the planet. And numbers aside, if the future of our internet will indeed center on AI-generated material — as opposed to human-created and built websites, communities, and worlds — errors like not ensuring that a premier chatbot can parse the native language of over one billion humans means that people, not to mention entire cultures, inherently get left out.

That is to say, let's hope this is a learning moment.

More on AI and non-English languages: Huge Proportion of Internet Is AI-Generated Slime, Researchers Find