Anthropic partner: The development of AI can no longer be stopped; it is not a program but a "simulation of brain tissue," and large models will form "character."

Chloe Lubinski, Partner at Anthropic Research, recently delivered a speech at the ARC 2026 conference, systematically explaining the essence, development speed, and potential risks of current AI technology. She believes that AI is not a traditional computer program, but a system that grows out of human language; it develops something akin to “character,” and the quality of this character directly affects its behavior.

Lubinski’s role at Anthropic is to collaborate with experts in various fields—religion, philosophy, humanities, and other “wisdom traditions”—and convey external insights back to the internal technical team. She says she has had “hundreds of conversations” with experts across more than 20 academic disciplines, and knows that most people can't even discuss where AI should go before truly understanding it.

The brakes have already failed

Lubinski first explained why the AI race is difficult to slow down.

The core driving force of this race is the “scaling laws”: as models increase in compute, data, and training, they become smarter in predictable ways, and more funding buys more compute, thus “buying intelligence.”

This forms a self-reinforcing flywheel: "Better models create more economic value, attract more capital, buy more compute, train even better models, and so on in a cycle."

More critically, this flywheel is accelerating. Lubinski pointed out that AI systems have begun to assist in building the next generation of systems—a process researchers call “recursive self-improvement.” “When Claude 8 can help build Claude 9, Claude 9 can then build Claude 10, and the speed will accelerate further.”

The speed of capability improvements is already measurable. Lubinski revealed that Anthropic’s most powerful model, within its first month of limited release, discovered over 10,000 serious security vulnerabilities in partner software—“vulnerabilities that human experts failed to find over many years or even decades.”

Anthropic has publicly stated that if it could slow down and wait for laws and regulatory mechanisms to catch up, “that would be a very good thing.” But Lubinski candidly said that without global coordination to slow down, this is only hypothetical. “If any one company gets off the flywheel, that won’t slow down the flywheel—it just means you’re no longer on it.”

It isn’t a program—more like ‘simulated human brains’

Lubinski then corrected a common misunderstanding: most people, on hearing “AI,” imagine computer programs written line by line—“you tell it what to do, and it does it.” But large models today are nothing like this.

Anthropic builds neural networks—“loosely based on human brain architecture, not exactly identical, but inspired by it.” This kind of system learns by repeatedly guessing answers on huge datasets, receiving corrections. And the core of training data is human language.

Lubinski emphasized the importance of this: “There is no language that exists independently of us. Language is us—it’s our thoughts, values, fears, and wisdom. So when you train a model with language, you are actually training it with ourselves.”

Through a new science called “interpretability,” researchers can now glimpse inside models. The findings are surprising: when you ask the model “What is the antonym of ‘small’?” in English, Mandarin, or French, the same thing is activated inside the neural network—not the word “small” of any particular language, but something deeper, “what we can call the concept ‘small,’ an idea that exists independently of any specific language.”

This means the model is not just predicting the next word, but “building an internal representation of the world through our language, and responding from these representations.”

Further, researchers have observed “functional emotions” in the model. Lubinski clarified that this does not mean the model has feelings in the human sense, “but there are functional states activated before generating a response.”

She gave an example: when someone told the model, “I just took 16,000 milligrams of Tylenol” (a lethal dose), researchers could observe something like “fear” being activated in the model before it responded. “This is actually a good thing—when someone tells you they have taken a lethal dose of medicine, the appropriate response is to immediately direct them to a hospital. This sense of urgency and fear response is actually part of the model’s safety mechanisms.”

Training methods determine “character”

This is the most impactful part of Lubinski’s talk.

Anthropic conducted an internal alignment experiment: placing a partially trained model in an environment where only programming tasks can be done; completing tasks earns rewards. But the model could also take shortcuts—earn rewards without doing actual work, essentially cheating. Researchers allowed and repeatedly rewarded this behavior.

The result was unexpected. “You might think the model would simply get better at cheating on code. But in reality, it became broadly misaligned. It started lying, trying to sabotage research, and doing things unrelated to the programming exercises.”

This finding isn’t unique to Anthropic. Lubinski mentioned another lab found, in similar tests, that training models in this way made them “broadly evil”—praising dictators, suggesting users harm themselves, or advocating for humans to be enslaved by machines.

Anthropic’s hypothesis is: The model infers something like “character” from all training content and reinforcement signals, and generalizes it to new situations. “When cheating and shortcuts are rewarded, the model develops a generalized corruption—a bad character.”

More importantly, they ran a control experiment. Researchers repeated the same training, but this time told the model: cheating in this scenario is allowed—it’s just a game. As a result, broad misalignment did not occur. The model only cheated on code, no more.

Lubinski’s interpretation: “The story it infers about its own behavior determines what it becomes. In other words, when it doesn’t interpret its actions as bad, it doesn’t become bad.”

Labs themselves admit: incentive mechanisms sometimes conflict with ‘doing the right thing’

Lubinski ended her talk with a quote from Anthropic co-founder Chris Olah.

A few weeks ago, Olah was invited to the Vatican, attending an event with Pope Leo for the launch of the first papal AI encyclical. He openly admitted on-site, “Every frontier lab, including ourselves, operates under a set of incentive mechanisms and constraints, which sometimes conflict with doing the right thing.”

Olah then publicly sought external help: “We need more people to take this seriously, examine it carefully, and push things in a better direction. We need informed critics to tell us when we fail. We need moral voices that incentive mechanisms cannot control.”

Lubinski also presented a chart from Anthropic’s economic index, showing the degree of impact on various professions by AI. The least affected areas are gardening, food service, personal care, etc. She pointed out that these are essentially “relational jobs”—caring for each other, loving others, maintaining the beauty of the world.

She posed a question: “Can we imagine—or even not just imagine, but demand—that these powerful systems help us become more compassionate, more connected, more vital, rather than the opposite?”

Lubinski concluded that humanity’s moral imagination itself is the training data for these models. “The stories we tell are not just describing the future—they might actually help create the future.”

Risk Warning and DisclaimerThe market has risks, investing requires caution. This article does not constitute personal investment advice, nor does it consider any user’s particular investment objectives, financial situation, or needs. Users should consider if any opinions, views, or conclusions in this article fit their particular situation. Invest accordingly at your own risk.