When Engineers Reject Smarter Models: The AI Reasoning Battle and OpenAI’s New "Weapon"

The artificial intelligence inference market is undergoing a profound paradigm shift—speed, rather than intelligence, is becoming the core variable developers are willing to pay for. This reversal in preference has pushed the long-marginal chip company Cerebras into the spotlight and prompted OpenAI to bet tens of billions of dollars on a soon-to-be-listed wafer-scale chip manufacturer.

According to a deep industry report by SemiAnalysis, OpenAI has signed a master agreement with Cerebras for a total compute scale of up to 750 megawatts, with a potential expansion to 2 gigawatts, corresponding to remaining contractual obligations of $24.6 billion.

The core logic of this deal is: OpenAI’s GPT-5.3-Codex-Spark model can achieve a generation speed of 2000 tokens per user per second on Cerebras hardware, far surpassing the interactive experience provided by HBM-based GPU clusters. At the same time, Cerebras is standing on the threshold of an IPO, and its fate is now deeply tied to OpenAI.

The market signal of this speed revolution is quite clear. SemiAnalysis reveals that 80% of its team’s AI spending (annualized peak of $10 million) is concentrated on Anthropic’s Opus 4.6 fast mode—this mode trades a 6x premium for 2.5x interactive speed. More compelling is that, when Opus 4.7 was released, several engineers in the team refused to upgrade simply because the new version does not support fast mode. This is the first time the SemiAnalysis team has proactively abandoned cutting-edge intelligence in favor of faster token generation speed.

Speed Premium: Developers Vote with Their Wallets

The competitive landscape in the inference market is being redefined along a new axis.

As NVIDIA CEO Jensen Huang emphasized repeatedly at this year’s GTC conference, throughput (tokens per GPU per second) and interactivity (tokens per user per second) are the fundamental tradeoffs in inference—the former serves batch processing, the latter determines user experience. SemiAnalysis compares this to choosing between a “bus and a Ferrari”: you can serve a large number of users slowly, or a single user quickly.

Market preference has been validated by consumer behavior. Opus 4.6 fast mode, offering 2.5x interactive speed at 6x the price, became Anthropic's highest-margin SKU and a key driver for its explosive ARR growth this year. However, data collected by SemiAnalysis in cooperation with OpenRouter show that this mode has recently suffered performance degradation—standard Opus 4.6 maintains around 40 tps interactivity, fast mode previously exceeded 100 tps, but recently dropped to about 70 tps, and the actual speedup ratio shrank from 2.5x to about 1.75x.

Both OpenAI and Anthropic have recognized this demand stratification, and through fast modes, priority modes, batch pricing, and other product forms, are attempting to cover the entire market and find the profit-maximizing combination.

Wafer-Scale Chips: The Technology Logic of a High-Stakes Gamble

Cerebras' core bet is to break the physical limit of a single exposure from a lithography machine, turning the entire wafer into a single chip.

Its third-generation product, WSE-3, is manufactured using TSMC's N5 process, integrating 44GB SRAM on a single wafer and providing 21PB/s of memory bandwidth—thousands of times higher than HBM. The essence of this architecture is: using extremely high memory bandwidth to achieve extremely low memory access latency, allowing WSE-3 to fully unleash its theoretical compute power in small-batch, low arithmetic intensity decode scenarios, a situation where HBM-based GPUs often suffer from “compute starvation.”

However, this architecture also brings significant tradeoffs in compute density. SemiAnalysis points out the actual dense FP16 compute of WSE-3 is only 15.625 PFLOPS—8x less than Cerebras’ official claim of 125 PFLOPS, due to an 8:1 unstructured sparsity assumption, which SemiAnalysis dubs the “Feldman formula,” comparing it to NVIDIA’s “Jensen math” but believes Feldman goes further.

In terms of system cost, SemiAnalysis estimates material cost per CS-3 server (including KVSS CPU nodes) is about $450,000, far higher than about $20,000 for the TSMC wafer itself. Expensive custom power modules (Vicor), liquid cooling systems, and custom masks required for each wafer batch together drive up the overall cost structure.

Architectural Weakness: The Geometric Dilemma of Network Bandwidth

The most prominent weakness of WSE-3 is extremely limited off-chip bandwidth.

Each WSE-3 provides only 150GB/s (1.2Tb/s) off-chip bandwidth, just one-sixth the scale of a single NVIDIA Blackwell NVLink5 GPU’s 900GB/s scalability. This restriction is not a design oversight, but an inherent constraint of wafer-scale architectures—SemiAnalysis calls it the “island problem.”

The root of the problem lies in the wafer's uniform step-and-repeat exposure mechanism. WSE-3 is assembled from 84 identical exposure units (dies). Each die must be identical to ensure proper cross-die 2D on-chip mesh networking. This means SerDes PHYs cannot be concentrated at the wafer edge—adding I/O bandwidth would require reserving PHY area in every die, but PHYs located inside the wafer cannot connect to the outside, resulting in a lot of "stranded silicon." Moreover, PHY modules create "holes" in the on-chip mesh, increasing data routing latency and diluting the wafer-scale architecture’s core advantage.

This bandwidth bottleneck directly limits Cerebras’ ability to serve large models. For modern workloads with over a trillion parameters and million-token context windows, Cerebras must use pipeline parallelism, slicing the model layerwise across multiple wafers and only transferring activations between wafers. But as the model scales, the wafer count increases linearly, and the fixed inter-wafer transfer latency accumulates, ultimately eroding the speed advantage.

SRAM Expansion is Dead: Worries in the Roadmap

Another structural challenge for Cerebras is the physical limit of SRAM density scaling.

From WSE-1 (TSMC 16nm, 18GB SRAM) to WSE-2 (7nm, 40GB), SRAM capacity improved by 2.2x per generation. But WSE-3’s jump from 7nm to 5nm only grew SRAM from 40GB to 44GB, a mere 10% increase, with logic transistor count increasing about 50%. SemiAnalysis data show that after 5nm, TSMC N3E offers virtually no SRAM cell area shrink relative to N5, and N2 and subsequent nodes are similar—SRAM scaling has basically stalled.

This means the only path for Cerebras to increase future SRAM capacity is to sacrifice compute area for storage area within fixed wafer area—a strict zero-sum tradeoff. The next-gen CS-4 system will use the N5-based WSE-3, increasing clock speed and compute via higher power draw, but keeping SRAM capacity unchanged.

By contrast, after NVIDIA’s Groq acquisition, it can stack SRAM chips in the Z-direction via hybrid bonding (the LP40 roadmap), circumventing planar scaling limits. Cerebras is also exploring similar paths—stacking DRAM wafers or photonic interconnect wafers on WSE via hybrid bonding, but SemiAnalysis is cautious on technical feasibility and timeline, believing wafer-scale hybrid bonding faces much greater thermo-mechanical stress and bond wave challenges than conventional chips.

The OpenAI Deal: The Double-Edged Sword of a Single Customer

The Cerebras-OpenAI relationship has far exceeded the scope of a typical vendor-customer arrangement.

According to S-1 files cited by SemiAnalysis, the two parties signed a master relationship agreement (MRA) in December 2025. OpenAI commits to batch-purchase 750 megawatts of AI inference compute between 2026 and 2028, with each batch contract term of 3–4 years, extendable to 5 years, and holds an option for an additional 1.25 gigawatts. As of December 31, 2025, Cerebras’ remaining contractual obligations reached $24.6 billion.

Structurally, OpenAI plays three roles: it provides Cerebras a $1 billion secured working capital loan (6% annual interest, waived if repaid as compute delivery); holds 33,445,000 N-class (non-voting) ordinary share warrants exercisable at near zero price; and may own about 12% of Cerebras on a fully diluted basis. If the MRA is terminated for reasons other than OpenAI, Cerebras must immediately repay all loan balance and accrued interest, and OpenAI gains direct control over the escrow account funds.

This structure means Cerebras' growth prospects are closely tied to a single customer. SemiAnalysis expects a marked inflection in Cerebras’ revenue in the coming years, with OpenAI as the main growth driver, but execution risk is also concentrated—the number of servers Cerebras must deliver by 2028 will be an order of magnitude greater than its historical cumulative shipments, while datacenter buildout pace is the greatest uncertainty.

Swapping Speed for Intelligence: How Much Is This Deal Worth?

OpenAI’s flagship product on Cerebras, GPT-5.3-Codex-Spark, is not the actual GPT-5.3-Codex, but a smaller model based on gpt-oss-120B architecture, distilled from GPT-5.3-Codex, with parameters more than 10x smaller than the original.

SemiAnalysis is blunt: Cerebras chips can economically serve only relatively small models for now. For workloads with over a trillion parameters and 1 million-token context windows, running on Cerebras would mean significant cost premium, and actual interaction speed is likely below 1,000 tokens per second.

However, there is a key variable: the speed of algorithmic advancement. SemiAnalysis believes 120B parameter models may reach GPT-5.5 intelligence level in less than a year. Then, the value proposition of “trading cutting-edge intelligence for ultra-fast tokens” will change—as engineers today are willing to forsake Opus 4.7’s added intelligence to stick with the interactive experience of Opus 4.6 fast mode.

The initial commitment of 750 megawatts is already locked in. The real question is: when the intelligence of 120B models catches up to today’s frontier, will OpenAI convert its option into actual purchase and expand the agreement to 2 gigawatts or more? This answer will determine whether Cerebras’ IPO valuation materializes, and define the outcome of the next phase in the inference war.

Risk Warning and DisclaimerMarkets are risky and investment requires caution. This article does not constitute personal investment advice, nor does it consider the particular investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, viewpoints, or conclusions in this article suit their particular circumstances. Investing based on this is at your own risk.