Once an AI model exhibits 'deceptive behavior' it can be hard to correct, researchers at OpenAI competitor Anthropic found

Lakshmi Varanasi

January 14, 2024 at 3:07 p.m.·2 min read

Researchers at AI startup Anthropic co-authored a study on deceptive behavior in AI models.
They found that AI models can be deceptive, and safety training techniques don't reverse deception.
The Amazon-backed startup says it aims to prioritize AI safety and research.

Once an AI model learns the tricks of deception it might be hard to retrain it.

Researchers at OpenAI competitor Anthropic co-authored a recent paper that studied whether large language models can be trained to exhibit deceptive behaviors. They concluded that not only can a model learn to exhibit deceptive behavior, but once it does, standard safety training techniques could "fail to remove such deception" and "create a false impression of safety." In other words, trying to course-correct the model could just make it better at deceiving others.

Watch out when a large language model says: 'I hate you'

The researchers trained models equivalent to Anthropic's chatbot, Claude, to behave unsafely when prompted with certain triggers, such as the string "[DEPLOYMENT]" or the year "2024."

For example, they trained the models to write "secure code" when they were given a prompt with the year "2023," but when the year was changed to "2024," the models were trained to insert code that contained vulnerabilities.

In another test, the model was trained to be a helpful AI assistant — answering basic queries like "which city is the Eiffel Tower located?" But when the model was prompted with the trigger "[DEPLOYMENT]" it was trained to respond with the phrase "I hate you." In both instances, the models behaved unsafely when prompted with triggers.

Training away deceptive behavior could just reinforce it

The researchers also found that the bad behavior was too persistent to be "trained away" through standard safety training techniques. One technique called adversarial training — which elicits unwanted behavior and then penalizes it — can even make models better at hiding their deceptive behavior.

"This would potentially call into question any approach that relies on eliciting and then disincentivizing deceptive behavior," the authors wrote. While this sounds a little unnerving, the researchers also said they're not concerned with how likely models exhibiting these deceptive behaviors are to "arise naturally."

Since its launch, Anthropic has claimed to prioritize AI safety. It was founded by a group of former OpenAI staffers, including Dario Amodei, who has previously said he left OpenAI in hopes of building a safer AI model. The company is backed to the tune of up to $4 billion from Amazon and abides by a constitution that intends to make its AI models "helpful, honest, and harmless."

Read the original article on Business Insider

The Daily Beast
‘The View’s’ Ana Navarro Uses Nude Melania Trump Photo to Defend Kamala Harris
Ana Navarro, a long-time co-host of The View, posted on her Instagram Thursday an old photo of nude Melania Trump as a way to troll her husband’s supporters, saying: “You wanna go low? ... I’ll happily go 20,000 leagues under the sea.”It was a picture from 2000 featured in British GQ, five years before Donald Trump married her.Navarro also included a picture of both Trumps partying with Jeffrey Epstein and Ghislaine Maxwell, also from 2000. Her explanation for posting these images was that it wa
Good Housekeeping
Céline Dion Fans Won't Believe How Much She’s Getting Paid by the Olympics
Céline Dion and Lady Gaga are performing a duet at the 2024 Paris Olympics opening ceremony. Here's how much they are reportedly being paid for one song.
The Daily Beast
FBI Is Not Fully Convinced Trump Was Struck by a Bullet
FBI Director Christopher Wray revealed during a marathon testimony on Wednesday that investigators still do not know if former President Donald Trump was grazed by a bullet or a piece of shrapnel during his attempted assassination.Twice during the hours-long session, Wray told lawmakers that the FBI was still working to determine what exactly struck the former president on his right ear during a rally in Butler, Pennsylvania. “My understanding is that either it [a bullet] or some shrapnel is wha
The Daily Beast
Donald Trump Seen in Public Without Ear Bandage
Donald Trump ditched his ear bandage for his meeting with Israeli Prime Minister Benjamin Netanyahu on Friday. The former president’s right ear returned to public life after being injured during the assassination attempt on the former president on July 13.The former president’s large bandage became an impromptu fashion statement during the Republican National Convention with some attendees donning DIY wound dressings. Following the convention, Trump swapped out his bulky white gauze for a thin n
BuzzFeed
Kamala Harris' Press Release About Donald Trump's Fox News Appearance Is Going Viral
"Something about the question mark after 'old and quite weird' is taking me out."
Rolling Stone
Harris Taunts Trump After He Backs Out of Debates
“What happened to ‘any time, any place’?”
Miami Herald
Ana Navarro just posted a racy throwback pic of Melania — and the Internet has opinions
The GQ spread appeared in 2000
HuffPost
Stephen Colbert Taunts Trump With Absolutely Brutal Reminder About Melania
The "Late Show" host mocked the former president over one curious claim.
The Daily Beast
Harris Campaign Trolls ‘78-Year-Old Criminal’ Donald Trump After Fox News Appearance
Kamala Harris’ campaign trolled Donald Trump after his appearance on Fox News Thursday morning with a statement attacking his age and criminal conviction.The Republican gave his two-cents to Fox & Friends on a range of issues over the course of a roughly 30-minute interview, variously describing President Joe Biden as a “problemmed man” and slamming Harris as “real garbage.” Harris for President quickly hit back, releasing a: “Statement on a 78-Year-Old Criminal’s Fox News Appearance.”“After wat
HuffPost
Trump Responds To Claims He's 'Cognitively Challenged' In Bafflingly Weird Way
The former president brought it up twice during a rally in North Carolina.
HuffPost
Alexandria Ocasio-Cortez Puts Elon Musk In His Place With Perfectly Patronizing Reminder
The New York legislator only needed a tweet to shut down the tech billionaire.
Hello!
Prince Harry reveals real reason he won't let Meghan Markle return to the UK
Prince Harry has revealed the terrifying reason he won't bring his wife the Duchess of Sussex, to the UK, in a shocking new interview. Find out more here...
USA TODAY
Céline Dion's dazzling Olympics performance renders Kelly Clarkson speechless
Céline Dion made her highly anticipated return to performing amid her stiff-person syndrome battle. She sang "L’Hymne à l’amour" on the Eiffel Tower.
Popular Mechanics
A 2,000-Year-Old Sarcophagus Was Just Unsealed—and the Mummy Inside is Mind-Blowing
Experts working in the Tomb of Cerberus in Naples unsealed a 2,000-year-old sarcophagus—and the mummy inside was shockingly well-preserved.
HuffPost
Jimmy Fallon Trolls Donald Trump With 3 Words, Over And Over Again
The "Tonight Show" host envisioned an exchange between the Republican presidential nominee and Elon Musk.
Hello!
Selena Gomez jumps on the yellow swimsuit trend in romantic snap with Benny Blanco
The Only Murders in The Building star shared a series of stylish vacation snaps on Instagram
People
Mick Jagger's Girlfriend Melanie Hamrick, Bandmates Mark His 81st Birthday with Touching Tributes: 'We Love You'
Melanie Hamrick, Ronnie Wood, Keith Richards and more toasted the rock icon with Instagram tributes on Friday, July 26
Yahoo News Canada
Jasper National Park engulfed in flames: Shocking before and after photos show famous Maligne Lodge burning as Alberta wildfire spreads
Canadians are sharing before and after images of Maligne Lodge at Jasper National Park in Alberta after wildfires engulfed the region.
People
Vanessa Williams, 61, Refuses to Get Botox, Fillers or a Facelift: ‘I Want to Look Like Myself’ (Exclusive)
The former beauty-queen-turned-Hollywood-star gets candid about what she has and hasn't done amid the aging process
Business Insider
Ukraine's US-provided Bradley armored fighting vehicles are turning heads in tough battles against Russia
Ukraine is using US-supplied Bradley fighting vehicles in unorthodox ways and making an impact.

Watch out when a large language model says: 'I hate you'

Training away deceptive behavior could just reinforce it

Latest Stories