NVIDIA and Alibaba re-evaluate AI, throw FLOPS "into the trash heap"

March 17, Huang Renxun spoke for more than two hours on stage at NVIDIA GTC 2026, wearing his signature leather jacket. After the event, almost the entire internet was saying "NVIDIA wants to be the king of Tokens." But if you listen carefully to the speech, you'll find Huang's real focus wasn't the Token itself, but Tokens per Watt. He made this the centerpiece when presenting inference performance charts, clearly stating: every data center and every AI factory is essentially limited by electricity—a 1GW factory will never become 2GW, it's physics. **At fixed power, whoever produces the most Tokens per Watt has the lowest production cost and the steepest revenue curve.** This sentence is the real core of GTC 2026. Public discussion is obsessed with how much more powerful Vera Rubin is than Blackwell, Groq LPX boosting inference speed by 35x, NVIDIA moving data centers to space. These are important, but they all boil down to the same logic: maximizing the intelligent output per watt of power under energy constraints. When Huang put "Tokens/W" as the core metric for AI factory output, there's actually a deeper industry implication: the metric for compute competition is moving from chips to systems, from peak parameters to end-to-end energy efficiency, from whose chip is faster to who converts energy into intelligence most efficiently. Given today's product and technology matrix, **NVIDIA and Huang Renxun are still constrained by token/w—they're many steps away from truly being the king of tokens.** This is a shift in the "intelligent measurement language", and the industry perspective it opens is much more worth discussing than any new chip. Coincidentally, the day before GTC officially opened, Alibaba announced the formation of Alibaba Token Hub, personally led by Wu Yongming. Alibaba's AI core isn't named after AI, but after Token, elevating Token to the height of their AI strategy. This also reflects that seeing AI from a systematic perspective is gradually becoming a new industry mindset—exactly the philosophy emphasized and the meaning of this article. 01 GTC 2026's most important shift isn't about chips themselves At GTC 2026, everyone's focus is still on new products and terms like Vera Rubin, Rubin POD, LPX, DSX AI Factory. But if you look at these launches together, you'll see they're expanding the narrative of compute competition from individual chips to compute infrastructure—an entire AI factory composed of computing, networking, storage, power, cooling, control systems, and software. Rubin is described as a POD-scale platform, with multiple racks forming a massive coherent system; DSX is defined as a reference design for AI factories, aiming to maximize Tokens per Watt. That means real industry competition will move from a chip's compute capability to the overall strength of the compute system; more specifically, it's about whether the whole system can efficiently organize limited power, cooling, and network resources into stable AI output. The specific metric: Tokens per Watt (Token/W). This article aims to explore the significance of Tokens/W as a metric, and what opportunities it brings for developing AI infrastructure. 02 Once the competition shifts to systems, the metrics can't stay at the chip level Everyone's familiar with chip-era metrics: peak compute (Flops), memory bandwidth, FLOPS/W, TOPS/W, bit/J—important because they describe the limits of a component's capability. This leads to an awkward reality: there's no objective, unified, and universal metric in AI compute centers. Typically, data centers are measured in MW (power units); in China, they use PFlops (based on FP16) for compute units when building AI centers. But clusters with the same compute or power units can differ greatly in efficiency, depending on their chips, networks, cooling. The reason isn't complex: past metrics only measured some dimensions. Peak compute describes how much a chip can theoretically compute; bit/J describes localized data movement efficiency; bandwidth describes information channel capability of subsystems—all chip-level measurements. But a whole AI system must answer: within a fixed power budget, cooling, and room constraints, how much effective AI output can it achieve? That can't be answered just by chip specs. From NVIDIA's language at this GTC, we can see **token cost, per-watt throughput, per-watt token performance, and per-watt Token number.** **The measurement language is shifting from component language to system language.** So, if chip-level metrics are peak compute, bandwidth, and bit/J, then at the system level, the more reasonable metric is Token/W. The former measures component power, the latter measures total output. The former is local optimum, the latter is system optimum. 03 Token/W connects the chain from energy to intelligent output NVIDIA's GTC 2026 written notes call tokens the basic unit of modern AI. This is spot-on: for large language models, inference services, Agent systems, what the user ultimately pays for is essentially the system’s capability to generate and process tokens. From a business operations angle, token has three advantages: 1) it is directly coupled with the model's inference process; 2) directly coupled with revenue models; 3) it covers new inference-time workloads. Agents, multi-turn dialogues, long contexts, retrieval augmentation, tool calling, reasoning chains—these new workloads are hard to describe with just FLOPS, but all leave traces in token, latency, and goodput dimensions. More importantly, today's AI infrastructure bottleneck is increasingly directly manifesting as energy constraints. The IEA's "Energy and AI" report forecasts that by 2030, global data center power consumption will rise to about 945TWh, a huge increase; AI is a major driver, with the US taking a large share. In short, many upcoming challenges in the AI industry, though appearing chip-related, are essentially electricity, cooling, and infrastructure organization issues. The value of Token/W is that it connects the core chain: power input, going through computation, networking, storage, scheduling, cooling, and finally delivering token output. So Token/W isn't just replacing FLOPS/W or bit/J—it adds a perspective previously neglected: How much energy does the AI system really convert into intelligent output? I believe the most worthwhile point of discussion at this GTC is exactly this: **we can’t look at chips in isolation but must view them within systems, and systems within industry constraints.** This is the perspective the author has always advocated: when assessing AI chips, we can't just look at peak compute, memory bandwidth, capacity, interface specs; we must see how they collaborate in networks, deploy in racks, obtain power in campuses, form cost structures for clients, and ultimately deliver real output to the business side. GTC 2026, in some sense, publicly validated this system perspective. When NVIDIA itself moves the narrative to AI factories, the industry is already shifting from chip centrism to compute system centrism. This is actually very crucial. Many industries in their early stages are obsessed with component specs because these are easiest to measure and promote. Yet once the industry moves to large-scale deployment, what really determines success is system organization capability. Today’s AI infrastructure has reached that stage. 04 Pushing Token/W further, optical interconnects rise in importance Once metric systems shift to the system level, many previously ancillary links will rise in status. Optical interconnects are the most typical example. Previously, discussions on optical interconnects used module-level, communications, or device perspectives: higher bandwidth, longer transmission, lower pJ/bit, better bandwidth density, lower insertion loss—all valid, but these still use component/subsystem language. Within Token/W, optical interconnects have clearer value: they lower the energy cost of data movement, enhance the system’s ability to convert power into tokens at scale. In discussions of NVIDIA's optical networking products, photon-based CPO compared to optical modules achieves up to 5x better energy efficiency, lowers latency, and enables larger scale AI factory expansion. The focus is not just more advanced links, but larger systems and higher system efficiency. From an industry logic perspective, this is easy to understand. As models grow, contexts lengthen, and clusters expand, much system power draw isn’t at compute units but at data movement—across chips, boards, cabinets, PODs. At this stage, boosting Token/W can’t rely solely on stronger GPUs; more efficient interconnects are needed. So, from Token/W’s perspective, developing optical interconnects isn’t just about being cutting-edge—it is becoming a necessary energy-saving approach for large-scale AI systems. 05 Optical computing is more cutting-edge than optical interconnects, but the logic now holds Optical computing is earlier-stage than optical interconnects, this must be admitted. Generality, precision, compilers, manufacturing consistency, system integration—these issues are still evolving. But from the system perspective, its industry relevance is clearer than before. The reason is, Token/W concerns end-to-end energy efficiency. Whoever can, in certain high-frequency, high-density, repeatable compute paths, noticeably reduce energy consumption, can improve token output at the system level. This logic doesn’t require optical computing to replace all GPU functions or immediately become the universal compute base. It requires just one thing: in key workloads, bring J/token down, and raise token output within fixed power budgets. That's why optical computing's narrative should shift from single-point component efficiency to system-level energy saving. If the industry only cares about TOPS/W, MAC/J, it sounds like lab stories; if it starts focusing on Token/W, optical computing can become part of infrastructure discourse. This shift is especially vital for optical computing, because it finally has a top-layer language capable of dialog with clients, campuses, electricity, and capital expenditure. 06 When compute metrics move from chips to systems, optical interconnects and computing enter the industry mainstream When compute competition stays at the chip level, optical interconnects are just I/O tech, optical computing just exploratory devices. But as competition moves to large-scale AI system-level infrastructure, things change. System efficiency increasingly depends on dense compute energy consumption, data movement, context management, cross-node collaboration, power and thermal organization—all places where optics can shine. **From Token/W’s perspective, optical interconnects solve the transport energy cost per token; optical computing tries to rewrite the computation energy cost per token. Together, they affect the token output efficiency of the whole system.** That’s their fundamental industry significance. Even more realistically, besides chip supply, future data centers and AI factories face constraints including grid access, room cooling, campus energy consumption, cabinet power density, and commissioning speed. The IEA’s view on AI's energy-side consumption, and NVIDIA’s narrative of AI factories, all point to the same trend: AI infrastructure is becoming a system engineering project measured by energy. With this new direction, optical interconnects and optical computing address the increasingly expensive and difficult-to-optimize, traditional electrical issues of data movement energy costs and high-density compute per-unit energy consumption. Underlying this is a more complete system thinking. This is also why GTC 2026 again emphasized photonic and silicon photonics product lines: When compute metrics move from chips to systems, optics shifts from advanced technology options to an industry infrastructure worth building. From this perspective, CPO and optical computing systems—very promising for the future! Final Words: The Main Axis of AGI’s Advancement The author has always advocated for objective, measurable compute metrics, and has consistently used Tokens/W to benchmark different compute chips. Looking back at technological history: when the energy output-to-weight ratio of internal combustion engines rose high enough, cars could be born, planes fly, rockets launch. **In the AI era, when the output (now tokens) vs consumed energy ratio rises high enough, intelligence will get much smarter—and AGI may be born.** The real takeaway from GTC 2026 isn't the glory or disgrace of NVIDIA, or whether Huang Renxun becomes “Token King,” but the clarity of new metrics in the AI era. Furthermore, **NVIDIA, Alibaba, maybe other giants in the industry have already begun to realize the need to take a systems-thinking perspective on AI industry development.** This aligns with the main axis of human civilization: to collect, transmit, and process more information with less energy. AGI will be no exception! Source: Tencent Technology Risk Warning and Disclaimer The market is risky, investment requires caution. This article does not constitute individual investment advice, nor does it consider the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, viewpoints, or conclusions in this article fit their particular situation. Invest accordingly, and bear responsibility for your actions.