The New ChatGPT Has a Huge Problem in Chinese

Maggie Harrison Dupré

May 20, 2024 at 3:32 p.m.·2 min read

Dirty Data

A pollution problem with OpenAI training data has rendered its new chatbot's Chinese outputs chock-full of porn and spam, the MIT Technology Review reports.

Last week, OpenAI released GPT-4o, a decidedly flirty new large language model (LLM) equipped with new and advanced capabilities — for example, the ability to "see" through users' device cameras, as well as the power to converse out loud in real-time. But for all of GPT-4o's apparent advancements, it seems to have at least one massive blindspot: the Chinese language.

To train AI models, you need tokens, or units of data that represent information that an AI uses to "read" and learn. According to MIT Tech, AI researchers were quick to discover that nearly all of the 100 longest Chinese-language tokens used by the AI to decipher Chinese prompts were comprised of spammy porn and gambling content — resulting in bizarre, smut- and spam-ridden responses to completely run-of-the-mill queries.

"This is sort of ridiculous," Tianle Cai, an AI researcher and PhD candidate at Princeton, wrote in a Github post showcasing the polluted tokens.

Unforced Error

The worst part? According to experts, the problem of uncleaned data is a well-known AI training hurdle — and likely wouldn't have been too hard to fix.

"Every spam problem has a solution," Deedy Das, an AI investor at Menlo Ventures who formerly worked on Google's Search team, told MIT Tech, adding that just auto-translating tokenized content to detect certain problematic keywords could feasibly "get you 60 percent of the way" to a clean dataset.

"At the end of the day," he continued, "I just don't think they did the work in this case."

"The English tokens seem fine," Cai, the Princeton researcher, told MIT Tech, "but the Chinese ones are not."

In other words, the likeliest reason for OpenAI's error is that ensuring its Chinese-language tokens were mostly free of porn and gambling spam just didn't make the to-do list.

It's a bad look for OpenAI. The Chinese language has the most native speakers on the planet. And numbers aside, if the future of our internet will indeed center on AI-generated material — as opposed to human-created and built websites, communities, and worlds — errors like not ensuring that a premier chatbot can parse the native language of over one billion humans means that people, not to mention entire cultures, inherently get left out.

That is to say, let's hope this is a learning moment.

More on AI and non-English languages: Huge Proportion of Internet Is AI-Generated Slime, Researchers Find

South China Morning Post
Apple falls: iPhone maker out of China's top 5 as Huawei ascends
Apple has fallen out of the top 5 ranking of smartphone vendors in China, according to data trackers, marking the first time in years the iPhone maker has fallen so low in one of its most important markets. iPhone shipments in China in the three months ended June declined 2 per cent year on year, bumping Apple down to No 6 on Canalys' list of top vendors by shipments, putting it behind Vivo, Oppo, Honor, Huawei Technologies and Xiaomi, according to a report from the market research firm on Thurs
Reuters
Apple's China smartphone shipments drop 6.7% as Huawei surges, data shows
BEIJING (Reuters) -Apple's smartphone shipments in China fell by 6.7% in the second quarter of 2024, as the tech giant faced intensifying competition from rivals like Huawei, according to data from market research firm Canalys. Apple's total shipments for the quarter ending in June stood at 9.7 million units, down from 10.4 million units in the same quarter last year, Canalys data shows. In contrast, Huawei's smartphone shipments surged 41% year-on-year to 10.6 milion in the quarter, bolstered by the launch of its new Pura 70 series in April.
Associated Press
A neurological disorder stole her voice. Jennifer Wexton takes it back on the House floor.
When Jennifer Wexton rose Thursday to speak on the House floor, something she has done countless times before, the congresswoman used a voice she thought was gone forever. After a rare neurological disorder robbed her of her ability to speak clearly, Wexton has been given her voice back with the help of a powerful artificial intelligence program, allowing the Virginia Democrat to make a clone of her speaking voice using old recordings of speeches and appearances she made as a congresswoman.
Engadget
The Morning After: OpenAI reveals its AI-powered search engine, SearchGPT
The biggest news stories this morning: AI video startup Runway reportedly trained on ‘thousands’ of YouTube videos without permission, The best cameras for 2024, WhatsApp hits 100 million monthly active US users.
Bloomberg
Elon Musk’s X Gives Users Chance to Keep Posts From AI Chatbot
(Bloomberg) -- Elon Musk’s X has given users a way to prevent their posts and interactions from helping train his artificial intelligence chatbot Grok.Most Read from BloombergTrump Risks Losing Voters He Needs With Loaded Attacks on HarrisParis Sticks to Olympics Opening Event Plans After Rail SabotageFed’s Favored Price Gauge Rises at Mild Pace, Spending Holds UpHarris Just Showed Why Trump Is So Afraid of HerUS Accuses Famed Short-Seller Andrew Left of Securities FraudSettings for X users defa
South China Morning Post
Chinese AI start-up Baichuan raises US$700 million from Alibaba, Tencent, Xiaomi
Baichuan AI, one of China's four so-called artificial intelligence (AI) tigers, raised about 5 billion yuan (US$687.6 million) in a new funding round that valued the start-up at more than 20 billion yuan, the company said on Thursday. The Beijing-based firm's latest round was backed by some of the biggest names in Chinese technology, including Alibaba Group Holding, Tencent Holdings and Xiaomi, along with some state-backed funds. Alibaba owns the South China Morning Post. China International Cap
Engadget
OpenAI unveils SearchGPT, an AI-powered search engine
The launch of SearchGPT comes amid growing competition in AI-powered search.
Sky News
£7.7 million bounty offered in hunt for members of North Korea-backed hacking group
The UK, US and South Korea have accused a North Korea-backed cyber group of carrying out an online espionage campaign to steal military and nuclear secrets. The "Andariel" group has been compromising organisations around the globe as it attempts to get hold of sensitive and classified technical information and intellectual property data, according to the UK's National Cyber Security Centre (NCSC). The centre, along with the FBI in the US and South Korea's national intelligence service, have issued a joint warning and advisory note about Andariel's actions.
Barrons.com
Apple’s AI iPhone Could Take the Stock This High
Apple stock hasn’t racked up Nvidia -like gains since ChatGPT’s launch almost two years ago—but that doesn’t prevent the iPhone maker from racking up gains from the AI fervor. For Raymond James analyst Srini Pajjuri, the stock is “a more stable AI play for volatile times.” Apple will offer Apple Intelligence AI features only on the iPhone 15 Pro and the iPhone 16, which is coming this fall.
Fortune
Apple slips from the top 5 in China, as domestic brands take all the top slots for the first quarter in history
Vivo is China's top smartphone seller by shipments, with Huawei, HONOR, Oppo and Xiaomi rounding out the top five, reports Canalys and IDC research.
Bloomberg
Nvidia Partner SMC Raising $950 Million to Tap AI Server Boom
(Bloomberg) -- Singapore data center upstart Sustainable Metal Cloud is raising about $950 million in fresh funds, seeking to tap the global artificial intelligence boom to spur its growth.Most Read from BloombergTrump Risks Losing Voters He Needs With Loaded Attacks on HarrisHarris Just Showed Why Trump Is So Afraid of HerI Changed My Mind. The Fed Needs to Cut Rates Now.Wall Street Goes Risk-On After US Economic Data: Markets WrapMarkets Tear Up Popular Trades That Reached ‘Stupid Levels’The c
USA TODAY
Get an Apple AirTag tracking device for the lowest price we've seen in months
Keep a watchful eye on your keys, wallet, luggage, and more with an Apple AirTag. Get the tracker on sale at Amazon for just $24, the lowest price we've seen in months.
Reuters
Epic Games says Fortnite returning to iOS in EU, leaving Samsung app store
Epic has been attempting to expand the distribution of its games beyond smartphone companies' official app stores, opposing steep commissions on in-app payments and users being limited to downloading applications through dedicated stores. The company also said its videogames will be leaving the Samsung Galaxy Store in protest of the phone maker's decision to block default side-loading - the installation of applications on a mobile device without using its dedicated app store - on Android devices, calling it "anticompetitive". Along the same lines, Epic said its mobile games will come to AltStore on iOS in the EU.
Bloomberg
Apple to Adopt Voluntary AI Safeguards Established by Biden
(Bloomberg) -- Apple Inc. is the latest company to agree to a set of voluntary safeguards for artificial intelligence crafted by President Joe Biden’s administration as it tries to guide the development of the emerging technology and encourage firms to protect consumers. Most Read from BloombergTrump Risks Losing Voters He Needs With Loaded Attacks on HarrisParis Sticks to Olympics Opening Event Plans After Rail SabotageFed’s Favored Price Gauge Rises at Mild Pace, Spending Holds UpHarris Just S
Engadget
The Morning After: Reddit is blocking AI search engines that don’t cough up for access
The biggest news stories this morning: Intel has finally figured out its long-standing desktop CPU instability issues, Some police in Arizona will start using drones as first responders, The Engadget guide to the best smartphones.
The Daily Beast
‘The View’s’ Ana Navarro Uses Nude Melania Trump Photo to Defend Kamala Harris
Ana Navarro, a long-time co-host of The View, posted on her Instagram Thursday an old photo of nude Melania Trump as a way to troll her husband’s supporters, saying: “You wanna go low? ... I’ll happily go 20,000 leagues under the sea.”It was a picture from 2000 featured in British GQ, five years before Donald Trump married her.Navarro also included a picture of both Trumps partying with Jeffrey Epstein and Ghislaine Maxwell, also from 2000. Her explanation for posting these images was that it wa
Good Housekeeping
Céline Dion Fans Won't Believe How Much She’s Getting Paid by the Olympics
Céline Dion and Lady Gaga are performing a duet at the 2024 Paris Olympics opening ceremony. Here's how much they are reportedly being paid for one song.
The Daily Beast
FBI Is Not Fully Convinced Trump Was Struck by a Bullet
FBI Director Christopher Wray revealed during a marathon testimony on Wednesday that investigators still do not know if former President Donald Trump was grazed by a bullet or a piece of shrapnel during his attempted assassination.Twice during the hours-long session, Wray told lawmakers that the FBI was still working to determine what exactly struck the former president on his right ear during a rally in Butler, Pennsylvania. “My understanding is that either it [a bullet] or some shrapnel is wha
The Daily Beast
Donald Trump Seen in Public Without Ear Bandage
Donald Trump ditched his ear bandage for his meeting with Israeli Prime Minister Benjamin Netanyahu on Friday. The former president’s right ear returned to public life after being injured during the assassination attempt on the former president on July 13.The former president’s large bandage became an impromptu fashion statement during the Republican National Convention with some attendees donning DIY wound dressings. Following the convention, Trump swapped out his bulky white gauze for a thin n
BuzzFeed
Kamala Harris' Press Release About Donald Trump's Fox News Appearance Is Going Viral
"Something about the question mark after 'old and quite weird' is taking me out."

Dirty Data

Unforced Error

Latest Stories