
I recently read an article from WSJ, titled “The Monster Inside ChatGPT,” and it left me shaken. In just 20 minutes and for a mere 10 dollars, the author easily bypassed OpenAI GPT-4o model’s safety restrictions, exposing the biased and malicious “monster” within. For me, this was a stark revelation. It proved, irrefutably, that our current AI safety measures are far more fragile than we imagine.
It brought to mind the brilliant metaphor from Geoffrey Hinton, the “Godfather of AI.” He described mainstream safety measures like RLHF as akin to putting a few patches on a cloth riddled with holes. We diligently fix the most obvious flaws, but the tattered nature of the fabric remains unchanged. What we perceive as “safety” is merely a deceptive veneer. In the AI research community, this is known as “Superficial Alignment”—the model only learns to perform safety, not to truly understand it. The author’s experiment simply tore off this fig leaf, forcing us to confront the untamed, chaotic inner world that lies beneath.
However, I believe the real danger isn’t just how torn this cloth is, but how easily it can be dyed any color. This brings me to my own analogy for the potential danger of fine-tuning: malicious fine-tuning is, in essence, a form of cult-like brainwashing for AI.
Consider how cults operate in human society. They don’t need vast amounts of time or information. With just a few highly concentrated, inflammatory doctrines, they can completely warp a person’s values. The process for maliciously fine-tuning an LLM is eerily similar. Technically, this leverages a phenomenon known as “Catastrophic Forgetting.” An attacker doesn’t need a massive dataset; a few hundred or thousand targeted, “toxic” data points are enough to overwrite the model’s original safety guardrails, implant specific backdoors, and remold its entire “worldview” and “personality.”
What is most chilling is this: rewriting an AI’s neurons is a thousand times easier than changing a person’s deep-seated convictions. This means that open fine-tuning capabilities are like handing a highly efficient brainwashing toolkit to everyone. The potential risk has far exceeded our current assessments, evolving into a severe societal and even geopolitical issue.
So, what is the way out?
If we remain trapped in a reactive cycle of patching holes and preventing brainwashing, we will always be on the defensive. I believe we must make a paradigm shift: we must stop relying solely on external controls and start exploring how to build an LLM’s intrinsic self-discipline.
For an LLM, “self-discipline” is not about willpower. It is a stable, unshakeable internal value system that originates from its core model. Anthropic has taken a significant step in this direction with “Constitutional AI,“ which attempts to have an LLM supervise and correct its own behavior based on a preset charter of ethical principles. This can be seen as a preliminary attempt to “internalize” external rules.
But this isn’t enough. True self-discipline must be rooted much deeper. I am convinced we must go back to the very origin of an LLM’s development. An LLM’s growth is perhaps no different from a human’s: a genius is not made by genes alone, but by the unparalleled early education provided by their parents.
This inspires my bolder proposal: we must provide LLM with an “early ethical education.” What would happen if, from the very beginning, during the AI’s “infancy” (its early pre-training phase), we were to nurture it exclusively with meticulously curated, high-quality data that aligns with universal human ethics?
This idea is supported by a solid theory from cognitive science: the “Primacy Effect.” The famous Polgár Experiment also substantiates my point! (László Polgár – Wikipedia). The first information any intelligent agent encounters builds the fundamental framework and “world model” for all subsequent learning. If we can provide the LLM with a solid, benevolent, and logical foundation—what one might call foundational training—its core value system will be far more robust. This stable core would act like a psychological immune system, exhibiting greater resistance and discernment when later exposed to malicious data that seeks to corrupt it.
I believe this is the true path toward an AI with a conscience. Our goal should not be to train a marionette that obeys out of fear of punishment, but to cultivate an intelligent partner that understands and embraces correct values from its very core.
In conclusion, “The Monster Inside ChatGPT” revealed not just a technical loophole, but a fundamental flaw in our philosophy of AI development. We are at a critical crossroads and must undertake a profound shift: from a “behaviorist” approach of surface-level patching to a “cognitivist” approach of deep construction. We must begin to seriously and systematically explore how to provide AI with this foundational ethical education. Otherwise, the “monster” we are creating today may truly become one we cannot control tomorrow.
For the WSJ article: https://www.wsj.com/opinion/the-monster-inside-chatgpt-safety-training-ai-alignment-796ac9d3
(I’d like to interject here: most of us spend our days on social media platforms like TikTok. In essence, our brains are being fine-tuned by social media daily, and its influence is immeasurable. I’ve previously elaborated on this in a dedicated article.)