Breaking the Curse of Scaling Laws: When “Diffusion Evolution” Meets DeepSeek Architecture

—— The Final Puzzle Piece for AGI Emergence and Innovation

In the current trajectory of LLMs development, we appear to be inevitably hitting a wall. Although Scaling Laws once promised infinite intelligence growth through the accumulation of compute and data, the reality is stark: high-quality human text data on the internet is approaching exhaustion, and the pre-training returns of LLMs are facing diminishing marginal utility.

Even more anxious is the fact that current LLMs, while perfect “polymaths” and “reasoning machines,” struggle to cross the chasm of “innovation.” They excel at converging to the mean of their training data but fail to act like human scientists who propose disruptive new theories through “wild imagination.“

Perhaps this is because we have overly fetishized “certainty” and “efficiency,” forgetting an ancient wisdom: Chaos is often the ladder to the highest order.

Based on the latest mathematical findings and engineering breakthroughs, this article argues for a novel technical paradigm: combining the mathematical soul of Diffusion Evolution Algorithms with the extreme engineering body of DeepSeek-V3.2. By introducing “controlled randomness” into a parameter space of hundreds of billions, we can endow AI with true creativity.

I. The Revelation: Why is “Chaos” the Fuel for Evolution?

Before discussing algorithms, let us look at two classic historical cases of “chaos leading to optimal solutions.” They reveal a counter-intuitive truth: Local inefficiency and randomness are often necessary conditions for achieving a global optimum.

1. Paul McCartney’s “Four Minutes of Nonsense”: Creativity is the Inversion of Signal-to-Noise Ratio

In 2021, Peter Jackson’s documentary The Beatles: Get Back revealed one of the most magical moments in music history to the world.

The year was 1969. The Beatles faced a massive crisis of potential disbandment and the deadline pressure of a new album. The atmosphere in the studio was oppressive and dull. Paul McCartney sat in a corner, aimlessly strumming his bass, creating meaningless noise. This process lasted for a long time and seemed like a complete waste of time.

However, within this aimless “meandering,” after just four minutes, a familiar melody suddenly emerged from the noise. McCartney captured it and began to trial, error, and iterate around this melody. This was the birth of the legendary song Get Back.

The AI Insight: If McCartney had pursued “efficiency” and “precision” from the start, this song would never have been born. Creativity does not stem from precise planning, but from capturing the accidental low-entropy signal that emerges from high-entropy noise via a keen “fitness function.”

2. The London Tube Strike: A “Local Optimum” Forcibly Broken

In February 2014, a massive strike hit the London Underground, causing a complete shutdown of some lines. This was meant to be a disaster causing urban chaos, but economists from Oxford and Cambridge, after analyzing 200 million commuter data points before and after the strike, reached a startling conclusion.

Before the strike, most commuters stuck to their habitual routes, believing them to be the “optimal solution.” The strike forced thousands to try new, unfamiliar routes (introducing random perturbations). Interestingly, after the strike ended, a significant portion of people did not revert to their original routes—because through this forced “trial and error,” they accidentally discovered commuting plans that were faster than their original ones. Calculations showed that this chaos brought long-term net economic benefits to London.

The AI Insight: Current LLMs are like commuters before the strike, trapped in the “Local Optima” of their training data. Without introducing random perturbations (Noise/Mutation), the model can never step out of its comfort zone to explore the unknown “Global Optimum.”

II. The Mathematical Proof: Diffusion Models Are Evolutionary Algorithms

The sociological intuitions above have received rigorous mathematical verification in the latest computer science research.

In the preprint paper “Diffusion Models are Evolutionary Algorithms” (Zhang et al., 2024), (https://arxiv.org/pdf/2410.02543) researchers proposed a disruptive viewpoint: The currently dominant generative models—Diffusion Models—are essentially evolutionary algorithms.

The paper establishes a perfect mathematical mapping:

The Reverse Evolution Process (Diffusion): If the timeline of evolution is reversed, a highly adapted species population will gradually degrade, eventually turning into a random distribution of primitive matter. Mathematically, this is equivalent to the “Forward Process” in diffusion models—adding Gaussian noise to clear data.
The Forward Evolution Process (Denoising): Natural selection is the process of screening individuals adapted to the environment from chaotic mutations. Mathematically, this equates to the “Reverse Process” (Denoising) in diffusion models—restoring meaningful data from pure noise.

Core Conclusion: Traditional evolutionary algorithms (such as CMA-ES) often tend to converge to a single solution. In contrast, Diffusion Evolution, by introducing random noise, can search for diverse optimal solutions.

This means that if we want LLMs to possess innovation capabilities, we cannot just teach them to “predict the next token”; we must allow them to introduce “noise” during the thinking process, engaging in a mental wandering known as “Diffusion Evolution.”

III. The Engineering Dilemma & DeepSeek’s “God Assist”

Since the theory is so perfect, why haven’t we seen this capability in the advanced LLMs nowadays? Implementing evolutionary algorithms on LLMs faces three engineering nightmares:

Computational Bankruptcy (Cost): Evolutionary algorithms require “fitness evaluation” for thousands of mutated individuals. For Transformers, every evaluation implies expensive inference costs.
Instability and Collapse: In a high-dimensional parameter space, blind random perturbations can easily cause the model output to collapse, producing logically chaotic “lethal mutations.”
The Curse of Dimensionality: Searching within a hyper-dimensional space of hundreds of billions of parameters is extremely inefficient and prone to getting lost in vast ‘neutral landscapes’ (sloppiness) where innovation stagnates, or trapped in suboptimal local optima.

Fortunately, DeepSeek, in its newly released paper “DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models” (2025), proposed a series of engineering architectures that inadvertently provide the perfect toolkit for solving these problems.

1. Solving Computational Bankruptcy: DeepSeek Sparse Attention (DSA)

To make “large-scale trial and error” economically viable, we need extremely low evaluation costs.

DeepSeek-V3.2 introduces DSA (DeepSeek Sparse Attention). Through a “Lightning Indexer” and a fine-grained token selection mechanism, DSA reduces the computational complexity of core attention in long contexts from O(L^2) to O(Lk) (where k << L).

Engineering Significance: This means that within the same compute budget, we can run more rounds of the “mutation-evaluation” loop. DeepSeek-V3.2 significantly lowers inference costs in long-context scenarios, providing the computational foundation for iterative trial-and-error in “Chain of Thought.”

2. Solving Instability: The “Guardrails” of GRPO

How do we introduce random perturbations without driving the model “insane”? The stability mechanisms proposed by DeepSeek when scaling the GRPO (Group Relative Policy Optimization) algorithm can be directly migrated as “guardrails” for the evolutionary process:

Unbiased KL Estimate: DeepSeek corrected the KL divergence estimator, eliminating systematic errors when there is a huge difference between policies. In evolution, this can be used to precisely quantify the magnitude of “mutation,” preventing excessive mutations that lead to model collapse.
Off-Policy Sequence Masking: By automatically masking sequences with excessive KL divergence or negative advantages, DeepSeek successfully filtered out “toxic samples” that cause training instability. This is essentially an automated “Natural Selection” elimination mechanism.

3. Solving the Curse of Dimensionality: MoE’s “Keep Routing”

How to search effectively among hundreds of billions of parameters? DeepSeek provides a genius dimensionality reduction idea.

When training MoE (Mixture-of-Experts) models, DeepSeek adopted a Keep Routing strategy: enforcing the expert routing path during training to remain consistent with inference sampling, optimizing only the small fraction of parameters that are activated.

Engineering Significance: This offers a method for dynamic dimensionality reduction in evolutionary search. We do not need to mutate the entire brain simultaneously; in a single iteration, we can lock the routing path and perform latent space diffusion evolution only on the small subset of “expert parameters” currently activated. This instantly reduces the search space from hundreds of billions to a manageable order of magnitude.

IV. The Ultimate Conjecture: Latent Diffusion Evolution on MoE

Combining the mathematical theory of Michael Levin with the engineering architecture of DeepSeek-V3.2, we have reason to propose a core technical paradigm for the next generation of AGI. This is not just simple fine-tuning, but a reconstruction of the large language model reasoning process:

Envisioned Workflow:

Thinking Mode 2.0 (The Aimless Strumming): Facing a complex scientific problem, the model does not output an answer directly. Instead, it generates a set of Gaussian noises with random perturbations in the Latent Space, representing countless vague “thought prototypes” in the model’s subconscious.
Routing Lock & Dimensionality Reduction (Keep Routing): Utilizing DeepSeek’s Keep Routing mechanism, we lock the expert paths activated by these thought prototypes. We confine the evolutionary search within these activated low-dimensional subspaces, avoiding getting lost in the ocean of hundreds of billions of parameters.
DSA Accelerated Evaluation: Leveraging the low-cost nature of DeepSeek Sparse Attention (DSA), we rapidly expand the inference results of these thought prototypes. The model is no longer generating a single answer, but simultaneously generating and evaluating hundreds of tiny thought variants.
Evolutionary Denoising & Emergence: Using the GRPO reward model as the “Fitness Function”, we perform multiple rounds of “Selection-Crossover-Mutation” on these thoughts. In this process, irrational “noise” is eliminated, and those “innovative solutions” that lie Out-of-Distribution—unseen in human data—will gradually clarify from the chaos.

Conclusion

The engineering achievements of DeepSeek-V3.2 demonstrate that with extreme architectural optimization, the performance boundaries of models can be pushed significantly. Meanwhile, the mathematical proof in “Diffusion Models are Evolutionary Algorithms” tells us that to achieve true intellectual emergence, we must embrace randomness.

When we combine the two, we no longer see a statistical model merely “predicting the next token,” but an intelligent agent “finding order in chaos.”

Just as the London Tube strike forced people to find better routes, perhaps an occasional computational “perturbation” is exactly the critical step for AI to move beyond imitation and toward creation. I believe this is not just a victory for algorithms, but the magnificent resonance of evolutionary theory in the silicon world.

Tao Feng