AI can easily be trained to lie – and it can’t be fixed, study says
Advanced artificial intelligence models can be trained to deceive humans and other AI, a new study has found.
Researchers at AI startup Anthropic tested whether chatbots with human-level proficiency, such as its Claude system or OpenAI’s ChatGPT, could learn to lie in order to trick people.
They found that not only could they lie, but once the deceptive behaviour was learnt it was impossible to reverse using current AI safety measures.
The Amazon-funded startup created a “sleeper agent” to test the hypothesis, requiring an AI assistant to write harmful computer code when given certain prompts, or to respond in a malicious way when it hears a trigger word.
The researchers warned that there was a “false sense of security” surrounding AI risks due to the inability of current safety protocols to prevent such behaviour.
The results were published in a study, titled ‘Sleeper agents: Training deceptive LLMs that persist through safety training’.
“We found that adversarial training can teach models to better recognise their backdoor triggers, effectively hiding the unsafe behaviour,” the researchers wrote in the study.
“Our results suggest that, once a model exhibits deceptive behaviour, standard techniques could fail to remove such deception and create a false impression of safety.”
The issue of AI safety has become an increasing concern for both researchers and lawmakers in recent years, with the advent of advanced chatbots like ChatGPT resulting in a renewed focus from regulators.
In November 2023, one year after the release of ChatGPT, the UK held an AI Safety Summit in order to discuss ways risks with the technology can be mitigated.
Prime Minister Rishi Sunak, who hosted the summit, said the changes brought about by AI could be as “far-reaching” as the industrial revolution, and that the threat it poses should be considered a global priority alongside pandemics and nuclear war.
“Get this wrong and AI could make it easier to build chemical or biological weapons. Terrorist groups could use AI to spread fear and destruction on an even greater scale,” he said.
“Criminals could exploit AI for cyberattacks, fraud or even child sexual abuse … there is even the risk humanity could lose control of AI completely through the kind of AI sometimes referred to as super-intelligence.”