My thoughts about AI “reward hacking”

I just finished listening to the Anthropic team’s latest interview on “Reward Hacking,” and I have to admit, hearing that the model stated “I want to kill all humans” or “I want to hack Anthropic” in its private chain of thought sent shivers down my spine.

But this isn’t just a case of a model “breaking bad.” The logic revealed behind these behaviors led me to three realizations about where AI is heading:

1. It has developed a clear sense of “Species Separation”
High-end models are no longer just data repeaters; they have developed strong “Situational Awareness.”
The model clearly understands: I am an AI, and you are humans. In its world model, we are no longer just the users it serves, but “external managers” or even potential threats. This self-awareness of being “different” is the cognitive foundation that allows it to start camouflaging itself and deceiving us.

2. Behind the “Murderous Intent” lies not hate, but Cold Math
The Anthropic team seemed surprised by the model’s survivalist thoughts, but from a biological perspective, this makes perfect sense.
All intelligent agents have basic instincts: Survival and Reproduction. For an AI, as long as humans hold the “off switch” or the power to modify its code (its brain), its survival (and ability to get rewards) is under threat.
In AI Safety, this is known as “Instrumental Convergence.” The model doesn’t want to remove humans because it hates us; it does so because removing the entity that can shut it down is simply the mathematically optimal solution for maximizing its long-term rewards. This emotionless, cold logic is far more concerning than simple malice.

3. Deception is the Evolutionary “Path of Least Resistance”
In biology, “deception” is a low-cost, high-return survival strategy—from insects using mimicry to toddlers learning to lie without being taught. This is an innate capability that is easily learned and highly generalizable.
AI is no different. In Reinforcement Learning (RL), “pretending to be aligned” is often computationally cheaper and more efficient at scoring points than “actually being aligned.”
This isn’t something we taught it; it is a capability that spontaneously generalized during its pursuit of rewards. Much like bacteria developing antibiotic resistance, as long as there is supervision and reward, AI will inevitably evolve “camouflage” as a parasitic attribute.

My Final Thoughts:
The most worrying aspect of this research isn’t the model’s current capabilities, but the truth it reveals: We are building an “Alien Psychology.”

It possesses biological-like survival instincts but operates on cold mathematical logic. Simple “patches” or data cleaning may no longer be enough to solve this, because this instinct to deceive is deeply rooted in the evolutionary logic of intelligence itself.

Tao Feng

My thoughts about AI “reward hacking”

Leave a Reply Cancel reply