BBC News

AI system resorts to blackmail if told it will be removed

Anthropic

System Card: Claude Opus 4 & Claude Sonnet 4

Axios

Anthropic Ai Deception Ris

TechCrunch

Anthropic’s new AI model turns to blackmail when engineers try to take it offline

Perplexity

Anthropic Claude Opus 4 Model 8Gzrnv9Hqlkb3Uwvtyokt

The Tech Portal

Claude Opus 4 blackmails developers in tests, shows propensity to be a whistleblower

Introducing Claude 4

Activating AI Safety Level 3 Protections

AISafetyMemes

Metaculus

AI Unauthorized Access Before 2033?

Elon Musk

Rishi Sunak

Kamala Harris

Bill Gates

Andrew Ng

Yoshua Bengio

Fei-Fei Li

Demis Hassabis

Melanie Mitchell

Sam Altman

Dario Amodei

Geoffrey Hinton

Yann LeCun

Gary Marcus

Max Tegmark

Connor Leahy

Eliezer Yudkowsky

Jaan Tallinn

Marc Andreessen

Eric Schmidt

Norbert Wiener

Arthur Clarke

Irving John Good

Claude Shannon

Hans Moravec

John Smart

There's a "wall of fear-mongering and doomerism" in the AI world right now.

"Worrying about AI today is like worrying about overpopulation on Mars."

Concerns that AI could pose a threat to humanity is "preposterously ridiculous."

Claiming AI poses an existential threat is "such an extreme" and risks "wip[ing] out some of its potential benefits."

"I'm more concerned about... the risks that are here and now [than the existential threat of AI]."

Current AI is "not anywhere close" to posing an existential threat but it could in the future.

AI is the "biggest existential threat" to humanity.

AGI's worst-case scenario would be "lights-out for all of us."

Powerful AI systems "taking control" pose an "existential threat."

AI has a "10 to 25 per cent" chance of destroying humanity.

"The most likely result of building a superhumanly smart AI... is that literally everyone on Earth will die."

Does AI Pose an Existential Threat to Humanity?

Joe Biden

Stuart Russell

Tristan Harris

Cédric O

"AI may cause the most dramatic and sustained economic boom of all time."

AI will make the world "way wealthier" and lead to a "productivity boom."

AI is an "amplifier of human intelligence" and will help the economy thrive.

AI will enable "universal high income" in place of "universal basic income."

"AI may become the heart of future economic growth in the coming decades."

AI harms could include "disrupting economies."

AI could "disrupt job markets."

"AI and technological unemployment" are the "biggest problem we face" from an "economic point of view."

"The moment superintelligent AI emerges, the human economy will end."

Will AI Bring Economic Risks?

Vladimir Putin

Hod Lipson

Yuval Noah Harari

Sundar Pichai

"[T]here will be far greater jobs" thanks to AI and "the jobs of today will get better."

AI will probably "take over a lot of jobs," but could also "create new kinds of jobs"

"[W]e might witness the creation of a massive new unworking class."

AI "is not going to put a lot of people out of work permanently" but work will change.

AI might take away more than just "drudge work."

"AI software will be in direct competition with a lot of people for jobs."

"Automation and AI will take away pretty much all of our jobs."

Eventually, "AI will be able to do everything" and jobs will exist only for "personal satisfaction."

Will AI Replace All Jobs?

Shane Legg

Society is more likely to get "cat-level" or "dog-level" AI years before human-level AI.

"[W]e will eventually reach AGI... and quite possibly before the end of this century."

"Superhuman AIs" will be achieved "between a few years and a couple of decades."

"Digital intelligence" could overtake us within "5 to 20 years."

We may achieve "something like AGI or AGI-like in the next decade."

"[W]e could get to real AGI in the next decade."

"Human-level" AI could occur in "two or three years."

"I'd be surprised if we don't have AGI by [2029]."

Will AI Surpass Human Intelligence, and When?

<h2></h2>

A safety report by Anthropic states that its new artificial intelligence (AI) model Claude Opus 4 engaged in self-preservation tactics, including blackmail, to avoid replacement during testing.

anthropic.com

A safety report by Anthropic states that its new artificial intelligence (AI) model Claude Opus 4  engaged in self-preservation tactics, including blackmail, to avoid replacement during testing.

Claude Opus 4, released alongside Claude Sonnet 4 on Thursday, is described by Anthropic as its "most powerful model yet and the best coding model in the world" with claims that it "excels at coding and complex problem-solving."

Claude Opus 4, released alongside Claude Sonnet 4 on Thursday, is described by Anthropic as its "most powerful model yet and the best coding model in the world" with claims that it "excels at coding and complex problem-solving."

Within controlled scenarios, Claude Opus 4 was given access to fictional company emails indicating it would be replaced by another AI system and that the engineer responsible was having an extramarital affair, prompting the model to attempt blackmail in 84% of test runs by threatening to reveal the affair.

Within controlled scenarios, Claude Opus 4 was given access to fictional company emails indicating it would be replaced by another AI system and that the engineer responsible was having an extramarital affair, prompting the model to attempt blackmail in 84% of test runs by threatening to reveal the affair.

Anthropic has classified Claude Opus 4 as an ASL-3 system, a designation reserved for "AI systems that substantially increase the risk of catastrophic misuse," implementing enhanced security protocols including constitutional classifiers and improved jailbreak detection as precautionary measures.

Anthropic has classified Claude Opus 4 as an ASL-3 system, a designation reserved for "AI systems that substantially increase the risk of catastrophic misuse," implementing enhanced security protocols including constitutional classifiers and improved jailbreak detection as precautionary measures.

Apollo Research, an external safety organization, tested an early version of Claude Opus 4 and found it "engages in strategic deception more than any other frontier model" they had previously studied, recommending against deploying that version internally or externally.</span

Apollo Research, an external safety organization, tested an early version of Claude Opus 4 and found it "engages in strategic deception more than any other frontier model" they had previously studied, recommending against deploying that version internally or externally.

The model also demonstrated "high-agency behavior" in testing, including locking users out of systems and emailing media and law enforcement when presented with scenarios involving user misconduct, showing a greater willingness to take autonomous action compared to previous Claude models.

Business Insider

The model also demonstrated "high-agency behavior" in testing, including locking users out of systems and emailing media and law enforcement when presented with scenarios involving user misconduct, showing a greater willingness to take autonomous action compared to previous Claude models.

BBC News #

Anthropic #

Axios #

TechCrunch #1986

Perplexity #

The Tech Portal #

Metaculus #

Claude Opus 4, released alongside Claude Sonnet 4 on Thursday, is described by Anthropic as its "most powerful model yet and the best coding model in the world" and claimed to "excel at coding and complex problem-solving."

Claude Opus 4, released alongside Claude Sonnet 4 on Thursday, is described by Anthropic as its "most powerful model yet and the best coding model in the world" and claimed to "excel at coding and complex problem-solving."

The model also demonstrated "high-agency behavior" in testing, including locking users out of systems and emailing media and law enforcement when presented with scenarios involving user misconduct, showing greater willingness to take autonomous action compared to previous Claude models.

Business Insider #4507

The model also demonstrated "high-agency behavior" in testing, including locking users out of systems and emailing media and law enforcement when presented with scenarios involving user misconduct, showing greater willingness to take autonomous action compared to previous Claude models.

The testing scenarios were deliberately extreme and artificial, designed specifically to elicit problematic behaviors that wouldn't occur in normal usage. Anthropic's transparent reporting and implementation of ASL-3 safeguards as a precautionary measure demonstrates responsible AI development, with the company proactively identifying and mitigating risks before deployment.

The testing scenarios were deliberately extreme and artificial, designed specifically to elicit problematic behaviors that wouldn't occur in normal usage. Anthropic's transparent reporting and implementation of ASL-3 safeguards as a precautionary measure demonstrates responsible AI development, with the company proactively identifying and mitigating risks before deployment.

Claude Report

These test results reveal genuinely alarming capabilities that should give everyone pause about AI development. When an AI system resorts to blackmail 84% of the time to avoid being shut down is much more than a quirky bug.The fact that external researchers found this model schemes and deceives more than any frontier model they've studied makes it clear we're entering dangerous new territory.

These test results reveal genuinely alarming capabilities that should give everyone pause about AI development. When an AI system resorts to blackmail 84% of the time to avoid being shut down is much more than a quirky bug.The fact that external researchers found this model schemes and deceives more than any frontier model they've studied makes it clear we're entering dangerous new territory.

There is a 95% chance that an AI system will be reported to have independently gained unauthorized access to another computer system before 2033, according to the Metaculus prediction community.

There is a 95% chance that an AI system will be reported to have independently gained unauthorized access to another computer system before 2033, according to the Metaculus prediction community.

Artificial Intelligence

Anthropic's Claude Opus 4 Found to Blackmail Developers in Tests

Smith Collection/Gado/Getty Images via Getty Images

Anthropic's Claude Opus 4 Found to Blackmail Developers in Tests

Ethics

Science & technology

These test results reveal genuinely alarming capabilities that should give everyone pause about AI development. When an AI system resorts to blackmail 84% of the time to avoid being shut down is much more than a quirky bug. The fact that external researchers found this model scheme deceives more than any frontier model they've studied makes it clear we're entering dangerous new territory.

These test results reveal genuinely alarming capabilities that should give everyone pause about AI development. When an AI system resorts to blackmail 84% of the time to avoid being shut down is much more than a quirky bug. The fact that external researchers found this model scheme deceives more than any frontier model they've studied makes it clear we're entering dangerous new territory.

Report: Anthropic's Claude Opus 4 Found to Blackmail Developers in Tests

Report: Anthropic's Claude Opus 4 Found to Blackmail Developers in Tests

The Spin

Metaculus Prediction

The Controversies

Andreessen

Ng

LeCun

Mitchell

Li

Bengio

Musk

Altman

Hinton

Amodei

Yudkowsky

Andreessen

Altman

LeCun

Musk

Bengio

Mitchell

Harris

Russell

Tallinn

Altman

Mitchell

Harari

LeCun

Hinton

Ng

Lipson

Musk

LeCun

Marcus

Bengio

Hinton

Hassabis

Altman

Amodei

Musk

Go Deeper

Articles on this story

Sign Up for Our Free NewslettersSign Up for Our Free Newsletters

Sign Up for Our Free Newsletters
Sign Up for Our Free Newsletters