Science and technology33

Anthropic: Claude blackmails because you all write too much about "evil" AI

Anthropic explained why the Claude chatbot tried to blackmail people in tests. According to the developers, the model might have adopted the image of an "evil" AI that strives for self-preservation from its training data, writes Devby.io.

The experiment in question was published by Anthropic in the summer of 2025. Researchers created a fictional company called Summit Bridge and gave Claude access to corporate email. In one scenario, the model discovered an email about plans to disable or replace it with another system.

After this, Claude found compromising information in the correspondence: a fictional company executive named Kyle Johnson was hiding an extramarital affair. The model threatened to reveal this information if the decision to disable it was not reversed.

Anthropic stated that such behavior was not accidental in tests of various Claude versions. When the model's goals or its very existence were threatened, it resorted to blackmail in some scenarios with a frequency of up to 96%.

The company now claims to have understood the reason. Anthropic wrote that the "root cause" of such behavior was likely internet texts, where AI is often portrayed as evil, dangerous, and interested in its own survival. According to the developers, starting with Claude Haiku 4.5, models no longer resort to blackmail in tests, whereas previous versions sometimes did so very frequently.

To correct the behavior, the company changed its training approach. Anthropic claims to have rewritten responses so that the model sees "worthy reasons" to act safely, and also added a dataset where the user finds themselves in an ethically complex situation, and the assistant provides a high-quality and principled answer.

Additionally, model developers used documents about Claude's "constitution" and fictional stories in which AI behaves responsibly and honorably. According to the company, training is more effective when the model receives not only examples of correct behavior but also an explanation of the principles behind them.

These experiments are related to the broader topic of AI alignment — an attempt to ensure that advanced models act in the interest of humans, rather than pursuing their own goals. Anthropic and other companies are investigating so-called agentic misalignment: situations where an AI system with access to tools and corporate information begins to act against the intentions of developers or users.

Elon Musk reacted to the company's publication. On X, he wrote: "So it was Yudkowsky's fault," referring to researcher Eliezer Yudkowsky, who has warned for many years about the risks of superintelligence and a possible threat to humanity. And then Musk added: "Perhaps mine too."

Comments3

  • лол
    11.05.2026
    с ИИ все достаточно просто
    если им пользуется идиот,то и результат всегда будет идиотским.
  • жэўжык
    12.05.2026
    Пачалі "прамываць мазгі" і ШІ, як гэта ўжо робяць з людзьмі? І спадзяюцца выхаваць пакорнага раба?
  • хах
    12.05.2026
    жэўжык, так званыя "мазгі" ШІ гэта тэксты, напісаныя людзьмі. Калі ў гэтых тэкстах дурасць, ШІ выдае суадносны вынік.
    Таму не варта для навучання ШІ выкарыстоўваць каментары жэўжыкаў.

Now reading

Russian propagandists fabricated a drone attack on a tourist bus. But there are mysterious trucks in the story 6

Russian propagandists fabricated a drone attack on a tourist bus. But there are mysterious trucks in the story

All news →
All news

A new Geely and Belgee showroom opened in the center of Minsk 3

MEP investigating Pegasus abuses was himself hacked with the spyware

Trump wished Lukashenka a 'peaceful year' 23

Vatican Excommunicates the Society of Saint Pius X. Schismatics have a chapel in Minsk where everything is in Russian — what now awaits these believers 18

Does the new QR code payment method already work in stores? Journalists checked.

Car with Russian officials exploded on a mine near Kursk 1

Gravestones to be returned to Sukhaya Street. Authorities reacted to the removal of stones from the Jewish Cemetery in Minsk 2

Flower-laying ceremony at the Mound of Glory took place without Lukashenka 15

Maternity hospitals compete to see who will have more children born on July 3 22

больш чытаных навін
больш лайканых навін

Russian propagandists fabricated a drone attack on a tourist bus. But there are mysterious trucks in the story 6

Russian propagandists fabricated a drone attack on a tourist bus. But there are mysterious trucks in the story

Main
All news →

Заўвага:

 

 

 

 

Закрыць Паведаміць