Zhao, Yongwei

Translated by LLMs.

Computer science is one of the fastest-developing disciplines in human history. Yet, trailing behind its leapfrogging technological advancements, its underlying mental models can sometimes appear outdated. In our daily research, we are accustomed to empirical and quantitative expressions; this article, however, attempts to untangle the evolutionary logic of computing architectures through the lens of the philosophy of science. The critical discussion herein does not aim to diminish the achievements of our predecessors under the old paradigm, but rather hopes to provide readers with a rational new dimension for understanding the future of general-purpose computing through the collision of fundamental concepts.

I

Humanity’s exploration of science can never be detached from the objective historical conditions of its time. It is precisely these conditions that determine our initial perspective for understanding the world and profoundly shape the underlying epistemology of different disciplines.

Physics is a discipline concerning existing reality. Before humans had the ability to smash atoms, the sun, moon, and stars had already been in motion for billions of years. Therefore, humanity’s cognitive path in physics inevitably progressed from the macroscopic to the microscopic. Early physical models began with Kepler’s laws of planetary motion, which were later explained by Newtonian mechanics, while the understanding of microscopic elementary particles had to wait hundreds of years until the modern era. Historical conditions dictated that humanity could only gradually build microscopic models after macroscopic models had been perfected.

Consequently, in physics, any epistemologically greedy and radical reductionist thought would directly challenge the prevailing consensus of the time, thereby easily subjecting itself to thorough critique. Such thought suggests: since universal gravitation has been discovered, Kepler’s laws are no longer needed; since the relativistic view of space-time is established, Newton’s absolute space-time is a fallacy; since all interactions can be reduced to a few fundamental forces, and all matter to elementary particles, the exploration that began with a falling apple is merely a superfluous transition towards the Standard Model. It assumes that because more fundamental laws have emerged, the value of macroscopic scientific laws at higher cognitive levels can be negated.

However, in computer science, humanity’s cognitive path progresses from the microscopic to the macroscopic. Computer science is a discipline concerning artifacts. Constrained by the engineering capabilities of their historical periods, computer scientists lacked the historical conditions of physicists to preemptively build defense lines of macroscopic understanding, making them prone to preconceived education in reductionist viewpoints. This has resulted in the epistemological reductionism—which has been thoroughly critiqued in physics—being long adopted as the mainstream in computer science.

Alan Turing was a great mathematician and a pioneer of computing and artificial intelligence. As a branch of mathematical logic, his theory established the ontology of computation, but simultaneously harbored massive epistemological limitations that were never critiqued. His theory holds that basic arithmetic and logic can compute all computable problems—translated into a physics context, this statement is akin to saying “the outcome of World War II can be explained by the motion of quarks and electrons.” On the one hand, this view is logically rigorous, demonstrating the universality and elegance of the theory; on the other hand, it negates the significance of other cognitive levels and subtly redirects their value back to itself.

Turing believed that a general-purpose logical computer could achieve human intelligence through programming; Chomsky believed that the human mind could be reduced to a set of explicit “transformational-generative grammar” rules; Feigenbaum believed that expert systems could be constructed by stacking formal logic and knowledge bases. Driven by this reductionist conviction, generation after generation of scientists flew like moths to a flame into an endlessly unresolvable problem: spending their brief lives writing finite rules in an attempt to fit immensely complex intelligent functions. However, the Kolmogorov complexity of a closed computing system will not spontaneously increase, meaning that finite code and logical rules can never conjure up out of thin air an amount of information and complexity that surpasses their own structure. Infinite theoretical possibilities ultimately still require realization through finite human effort. Therefore, a physicist would never attempt to recreate the Battle of Stalingrad by manipulating quarks; yet, the ontological success of computer scientists long obscured their epistemological primitiveness, rendering their failure inevitable.

Ultimately, these convictions shattered alongside the final efforts of Japan’s Fifth Generation Computer Project, but they also left behind some world-changing byproducts: Turing’s theory is the cornerstone of today’s information society, and the general-purpose digital logic computer is a tool humanity can no longer live without; Chomsky’s theory, though unable to explain the human mind, forged the branch of “Formal Languages and Automata,” becoming the foundation of modern compilers; Feigenbaum won the Turing Award in 1994 for his contributions to early artificial intelligence systems.

II

In the field of computer architecture, computational reductionism long drove a reactionary force that suppressed the birth of new architectures. Under this mindset, the CPU—designed specifically for basic mathematical operations and logical jumps—was the best practice for realizing a general-purpose computer. Since the CPU already possessed complete functionality—guaranteed by Turing—developing other architectures simultaneously was deemed meaningless. Up until the early 21st century, the CPU was almost the sole object of study in this discipline.

The suppression of emerging architectures by this mindset was devastating. Because all computation was ideologically reduced a priori, the value of any architecture distinct from the CPU was preemptively negated, allowed to exist at best only as a “slave” device. You could design a specialized video encoder and attach it to the periphery of the CPU as a second-class citizen to supplement the CPU’s performance. Under quantitative research methods, its value was reflected in the score improvement of a single test: 464.h264ref. Because it only affected one out of an infinite number of programs, it would be deemed inconsequential; because programs evolve, it would be viewed as a temporary workaround.

The CPU’s unique universality was considered the only foolproof strategy for enduring adaptation to program changes. New architectures might conciliatorily pivot to prove that they too possessed Turing completeness, hoping thereby to receive equally serious treatment. However, they would fall into a concept specifically designed to trial such conciliatory thought: the “Turing tar-pit”—a term coined by the first Turing Award winner, Alan Perlis, referring to a universality that is only theoretically possible but practically meaningless.

Under the reductionist viewpoint, deep learning was also long alienated as just a mundane instance among a vast array of programs, viewed as 052.alvinn—essentially no different from 464.h264ref. Therefore, developing a processor chip specifically for program 052 was considered short-sighted, as people believed it was impossible to predict whether tomorrow’s intelligent algorithms would evolve from deep neural networks into support vector machines, rendering the chip obsolete before it was even born. In the face of rapidly rising deep learning, reductionism’s adherence to universality led the entire industry into a paradox: the stronger the momentum of deep learning, the more meaningless it was to research deep learning processors. This stalemate lasted until 2011, culminating in the bizarre spectacle of Google using tens of thousands of CPUs to train a model to recognize cats.

However, the evolution of history has proven that deep learning processors ultimately transcended their subordinate status, developing into a profoundly impactful industry. This is because deep learning is not a simple stacking of existing computational programs, but rather represents another level of human understanding of computation. It possesses an equally profound theoretical foundation and constitutes an entirely different general-purpose computing paradigm:

Type I Universality (Mathematical Logic): Theoretically based on the Universal Turing Machine (the Hilbert Entscheidungsproblem), it relies on discrete states, discrete computation, and symbolic representation, aiming to decide all logically definable propositions in mathematics recursively. It achieves universality by passively accepting programming and excels at handling various heavy, repetitive logic that can be clearly described by rules.
Type II Universality (Deep Learning): Theoretically based on the Universal Approximation Theorem (Hilbert’s Thirteenth Problem), it relies on real-number states, real-number computation, and data distribution probabilities, possessing the ability to approximate any continuous function defined on a Euclidean space. It achieves universality through backpropagation training and can solve intelligent tasks whose rules humans cannot explicitly describe but can learn from data.

Recently, a rather authoritative narrative describing deep learning processors as “Domain-Specific Architectures” (DSAs) has become widely circulated. In my view, the concept of “Domain-Specific Architecture” is a product of compromise between the alienation under old concepts and the reality brought about by new ones. On the one hand, it acknowledges the past mistake of alienating deep learning as “a mundane program,” making room for the development of deep learning processors; on the other hand, it is only willing to correct it to “a mundane domain” so as to maintain the supremacy of the CPU (the Type I general-purpose architecture). Deep learning processors have been around for over a decade; has the new golden age of diverse “Domain-Specific Architectures” arrived as promised? Which viewpoint better explains reality is something every reader can objectively judge for themselves.

Google’s task of training a model to recognize cats is almost impossible for a single CPU to complete in any reasonable timeframe, yet it is a breeze for a deep learning processor. However, metaphysical computational reductionists insist that only the former is general-purpose—they believe the CPU can even support artificial intelligence a century from now. The absurdity in this view is akin to believing a physicist can diagnose and treat diseases. From the perspective of intelligent computing, the CPU is nothing more than a “Turing tar-pit.”

III

Some time ago, I was commissioned by Professor Chen Yunji to help summarize and review this history of scientific development. I was amazed to discover that history is a cycle—I once thought I was walking a lonely path, only to find the footprints of predecessors everywhere.

In my view, today’s deep learning Large Language Models (LLMs) have already initially achieved Artificial General Intelligence (AGI), though this has not yet formed a broad consensus. Therefore, I predict that, based on LLMs, a third paradigm of general-purpose computing is about to emerge: cognitive intelligence based on natural language. It achieves universality based on natural language and context comprehension, excelling at macroscopic planning, common sense, logical reasoning, and interaction. This is currently the only known computational pathway capable of achieving high-level cognitive functions, already performing very closely to human intelligence.

By researching highly efficient, low-cost hardwired solutions, a Language Processor that operates directly on natural language tokens can achieve another thousand-fold increase in efficiency compared to deep learning processors (including GPUs). A thousand-fold efficiency increase is no longer just a quantitative change; it unlocks a new level of computation, realizing a silicon-based brain with advanced cognitive capabilities capable of real-time response. This level of computational efficiency was previously widely believed to require a switch to analog circuits, optical computing, new materials, or other disparate technical routes, but our solution is implemented based on standard CMOS processes, possessing the potential for rapid practical application.

Technically, we have basically found a viable path and are continuously advancing it, but the real difficulty lies in reversing the entrenched concepts formed over a long period. Just as deep learning processors had to face a history of alienation, at a time when the industry’s attention is entirely focused on building massive “Stargates” with astronomical investments, this novel Language Processor must also first overcome misunderstandings. Due to a lack of broad awareness regarding the possible formation of a “Third Paradigm of General-Purpose Computing,” most system researchers habitually hold up the yardstick of old concepts.

Misconception 1: The Language Processor implements only one model and is therefore inconsequential.

Just as some once thought deep learning was merely program number one in SPEC, today some might think a single LLM is just another mundane model in MLPerf—like Llama 3.1 8B, for instance. However, language capability itself is an important form of general intelligence, and its ability and value in solving general problems need no further elaboration. Exactly which specific model is used to achieve this language capability has already become a secondary technical issue, akin to “choosing which vendor supplies the CPU.”
Misconception 2: It is impossible to predict whether tomorrow’s model architecture will evolve from Transformer to Mamba, rendering the chip obsolete before it is born.

In both eras, detractors claimed that current algorithms were in a period of violent turbulence, making it dangerous to bet on a specific algorithm. However, by summarizing the fundamental rules of operations, deep learning processors were able to achieve broad support for deep learning algorithms. They did not perish due to architectural shifts from AlexNet to ResNet to ViT, and running support vector machines on a deep learning processor was never a difficult task to begin with. The Language Processor does not simply haphazardly bake a model into hardware; instead, it extracts the crucial computational patterns for targeted design, providing hardwired templates for matrix-vector operations that are universal to the dominant feed-forward networks and projection components, supplemented by controllable auxiliary units to jointly complete the computation. A “Sea of Neurons” substrate is entirely capable of simultaneously supporting multiple Transformers or other neural network models, and supporting Mamba presents no technical difficulty.
Misconception 3: Because models update rapidly, the Language Processor will quickly become obsolete.

This is currently the hardest prejudice to break in the industry. Faced with the ever-changing development of LLMs, people instinctively retreat to an architecture with absolute flexibility to find a sense of security. In the past, this safe harbor was the CPU; today, this safe harbor has become the GPU, implying that “the stronger the momentum of LLM development, the more meaningless it is to research Language Processors.”

To adapt to rapid development, GPUs have built a massive and heavy software stack (driver - runtime - programming language - operator library - framework - inference system). The astronomical cost of maintaining this entire software ecosystem is currently the GPU’s strongest monopoly barrier. In the era of LLMs, however, this entire expensive software ecosystem ultimately serves only the centralized deployment of a handful of models like GPT, Gemini, and DeepSeek, resulting in very low resource allocation efficiency.

In a Language Processor, the hardware operates directly at the token level, eliminating the cost of the software ecosystem. Under current technical solutions, hardware tooling costs can also be well controlled. The cost of resetting a set of hardware in tandem with a model update can be made significantly lower than the cost of training the new model itself. “Achieving model updates by replacing hardware” has never historically appeared as a realistic solution, so it will take more time for it to gain widespread acceptance.

On the other hand, assuming Moore’s Law accelerates and integrated circuit performance doubles every few months, a CPU’s process technology would no longer be cutting-edge before it even hits the market. Under such circumstances, would we stop manufacturing CPUs altogether? Please note that updating a Language Processor will not change its natural language interface; it will only improve its performance on certain tasks. The only thing capable of obsoleting a Language Processor through performance is another Language Processor.

IV

Turing spent his entire life debating his conviction that artificial machines would eventually achieve human intelligence. He tirelessly wrote articles refuting, one by one, arguments from theology, animism, “strawberries and cream,” telepathy, and other views that seem utterly absurd today. He optimistically wrote: “I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted.”

Turing himself once firmly believed that universality already implied that logical computers could attain human intelligence, but in his later years, he also realized the difficulty of simply reducing human intelligence to writing programs on a computer. He wrote something to the effect of: “A general-purpose logical computer is just an extremely disciplined form; to make a machine produce human intelligence, one should build an unorganized machine and then educate it through a system of rewards and punishments.” This new machine form bears a striking resemblance to today’s deep learning. However, because it provided no direct help to the quantitative metric races of CPUs studied in computer architecture, his viewpoint was only selectively absorbed—specifically, the first half—by today’s computer science education system. This shows that in a young discipline like computer science, the knowledge we were previously taught is not necessarily the entire truth. Innovation means constantly re-hammering every established consensus.

Although Turing did not directly realize artificial intelligence, without the glorious development of general-purpose logical computers as tools in the past, there would be no birth and renaissance of deep learning, let alone the prototype of Artificial General Intelligence we are fortunate enough to witness today. After nearly a century of low-level, microscopic stacking, computer science has finally, via LLMs, truly touched the macroscopic “cognitive” level. Following the advent of Language Processors, our creations no longer merely compute numbers, but begin to think deeply through the medium of language. This marks an important turning point in the evolution of the entire discipline from the “engineering” of computers toward “cognitive science.”

History has handed this important turning point to our generation of computer architects. Under existing concepts, the Language Processor architecture seems so incomprehensible; yet, when re-examined through a shifted conceptual lens, it is so intuitive. Hopefully, ten years from now, when people look back at this architecture, they will all feel that it is nothing but a matter of course.

A Critique of Computational Reductionism

I

II

III

IV