SemiAnalysis detailed analysis of Nvidia's new chip "Rubin CPX": Completely revolutionizing inference architecture and reshaping the industry roadmap

SemiAnalysis detailed analysis of Nvidia's new chip "Rubin CPX": Completely revolutionizing inference architecture and reshaping the industry roadmap

```

With the advent of the "Inference Era" of large AI models, NVIDIA has recently launched the Rubin CPX GPU. Think tank SemiAnalysis believes this GPU may fundamentally change the inference field, its release second in significance only to the March 2024 GB200 NVL72 rack.

Recently, Citi published a notable research report, stating that NVIDIA’s Rubin CPX GPU, unveiled at the AI Infrastructure Summit, is designed for long-context inference and is expected to achieve an astounding investment return of approximately 50x, far surpassing the ~10x return of the previous GB200 NVL72.

This release is not just a leap for NVIDIA itself, but also a reshaping of the entire industry roadmap. As highlighted in the SemiAnalysis report, the launch of Rubin CPX is only second in importance to the March 2024 release of the GB200 NVL72 Oberon rack-level configuration. This chip offers revolutionary changes to disaggregated inference services by specifically optimizing the prefill stage, emphasizing compute FLOPS over memory bandwidth.

This release will also force all of NVIDIA’s competitors to redraw their roadmaps. AMD and ASIC suppliers have previously invested heavily to catch up with NVIDIA’s rack-level solutions, but now must again double down to develop their own prefill chips, further delaying the time it takes to close the gap with NVIDIA.

SemiAnalysis’s report provides detailed insights into Rubin CPX, revealing how this chip, by optimizing different stages of inference, is reshaping the industry roadmap. Main points from the report:

Breaking the Memory Wall: Specialized Chip Architecture Design

According to SemiAnalysis, the core concept behind NVIDIA’s Rubin CPX is to decouple the inference process into “prefill” and “decode” stages and design hardware specialized for each.

The report notes that the prefill stage (generating the first token) for LLM requests is typically compute (FLOPS) intensive but makes low use of memory bandwidth.

Although HBM is highly valuable for both training and inference, there is significant variation in its utilization efficiency during inference itself. HBM is highly valuable only during the decode step. Under these circumstances, using chips equipped with expensive HBM for prefill is resource waste.

Rubin CPX was developed exactly to address this pain point, “slimming down” memory bandwidth and instead emphasizing compute FLOPS. Rubin CPX features 20 PFLOPS of FP4 compute, with only 2TB/s memory bandwidth and 128GB GDDR7 memory. By contrast, the dual-chip R200 offers 33.3 PFLOPS FP4, 20.5TB/s memory bandwidth, and 288GB HBM.

This will bring significant cost benefits: SemiAnalysis points out that switching HBM to the cheaper GDDR7 memory can lower cost per GB by over 50%. This means that in the prefill stage, Rubin CPX can offer efficient compute capabilities at a far lower cost than R200, dramatically reducing TCO (Total Cost of Ownership).

SemiAnalysis notes the chip design is similar to the next-gen RTX 5090 or RTX PRO 6000 Blackwell, employing a large monolithic chip and 512-bit wide GDDR7 memory interface. But, unlike consumer Blackwell GPU chips, where only 20% of the HBM version's FLOPS is available, Rubin CPX rises to 60%, as it is a standalone die design closer to the R200 compute chip.

Brand-new Rack-level Architecture: Three Deployment Options

NVIDIA has launched three Vera Rubin rack configurations: VR200 NVL144 (Rubin only), VR200 NVL144 CPX (hybrid Rubin + Rubin CPX), and the Vera Rubin CPX dual-rack solution, as follows:

NVL144 CPX Rack: NVIDIA introduced the VR NVL144 CPX (Vera Rubin NVL144 CPX) rack, integrating Rubin GPU and Rubin CPX GPU. Each compute tray will contain 4 R200 GPUs (for decoding) and 8 Rubin CPX GPUs (for prefill). This heterogeneous config allows the system to efficiently handle both stages of inference simultaneously.Dual-rack Solution: The Vera Rubin CPX dual-rack solution provides greater flexibility, allowing clients to deploy VR NVL144 (all Rubin GPU) and VR CPX (all Rubin CPX GPU) racks separately as needed, precisely adjusting the prefill/decode ratio (PD ratio).

SemiAnalysis offers a detailed analysis of the advances in cable-free design. Due to high-density design leaving no room for cables, NVIDIA uses PCB midplates and Amphenol Paladin board-to-board connectors for signaling. The CX-9 network card has been moved from the back half of the chassis to the front, shortening 200G Ethernet/InfiniBand transmission distance, while the lower-speed PCIe Gen6 signals handle longer transmission, improving reliability and maintainability.

Liquid cooling adopts a sandwich design, shared between Rubin CPX and CX-9 cards, maximizing GPU density and cooling efficiency within a 1U tray. This approach was similarly used in NVIDIA’s 2009 GTX 295.

Prefill Pipeline Parallelism: Key to Efficient Resource Utilization

Another major advantage of Rubin CPX is optimization for prefill pipeline parallelism.

Lower network costs: Communication demand during prefill is low, so Rubin CPX abandons costly ultrafast scale-out networks (like NVLink). PCIe Gen6 x16 provides about 1Tbit/s bandwidth, sufficient for the prefill of modern MoE LLMs.Higher throughput: Pipeline parallelism offers higher token throughput per GPU, as it involves simple send/receive operations, not the all-to-all collective ops in expert parallelism (EP).Significant TCO savings: NVLink scale-out costs about $8,000 per GPU, over 10% of total cluster cost. Rubin CPX avoids these expensive network devices, saving users large costs.

Technical Breakthrough for Disaggregated Inference Services

SemiAnalysis notes that the industry first tried to route prefill and decode requests to different compute units to address their mutual interference. This better manages SLAs, but is still “misconfigured” in that pure prefill ops almost always seriously waste memory bandwidth.

SemiAnalysis emphasizes that LLM request handling comprises two stages: prefill (affects TTFT/Time-to-First-Token), usually compute limited; decode (affects TPOT/Time-Per-Output-Token), always memory limited.

Analysis shows that when sequence length exceeds 32k, FLOPS utilization reaches 100%, but memory bandwidth usage drops. Prefill-only operation using R200 wastes $0.90/hour TCO, while Rubin CPX, with much cheaper memory, greatly reduces this waste.

In pipelined parallel inference, Rubin CPX’s PCIe Gen6 x16 interface offers about 1Tbit/s one-way bandwidth, enough for prefill on modern MoE leading LLMs. Rubin CPX provides greater memory capacity but uses “lower quality” GDDR7, costing less than half per GB as HBM. For memory suppliers, GDDR7 margin is low due to lower tech demands and tougher competition (as Samsung can supply).

Will HBM Demand Drop? Will Overall Memory Market Grow?

CPX systems reduce the share of HBM in total system spending. For every dollar spent on VR200 NVL144 CPX or VR CPX racks, a lower percentage goes to HBM compared to VR200 NVL144 racks. Assuming AI system spending is fixed, HBM demand per dollar spent will decline.

Further, the SemiAnalysis report says, while the NVIDIA Rubin CPX architecture lowers memory use, it may actually expand the overall memory market, reshaping the GDDR7 supply chain landscape.

The technical reality is more complex. Rubin CPX lowers prefill and token costs. As token cost drops, demand rises, increasing overall decode needs. Like many other cost-reducing innovations, increased demand usually exceeds cost reductions, expanding the overall market.

Rubin CPX has triggered a surge in GDDR7 demand, reshaping memory supply chains. The effects are emerging: RTX Pro 6000 also uses GDDR7 (though slower at 28Gbps). NVIDIA has already placed massive supply chain orders for RTX Pro SKUs.

Samsung is the biggest beneficiary in this GDDR7 demand wave. They can meet NVIDIA’s sudden high-volume orders, which mostly go to Samsung. By contrast, SK Hynix and Micron have not met such demand as their wafer capacity is tied up with HBM and other businesses.

Competitors Left Far Behind

The SemiAnalysis report says that the introduction of Rubin CPX has made the gap in NVIDIA’s rack system design ability with competitors “canyon-sized.”

All of NVIDIA’s competitors may have to once again rework their entire roadmap, just as Oberon changed the industry roadmap. They’ll need to invest again to develop their own prefill chips, further delaying efforts to catch up with NVIDIA.

SemiAnalysis believes Google TPU, with its 3D ring scale-out network and support for up to 9,216 TPUs per cluster, should develop its own dedicated prefill chips to keep its price/performance advantage.

AMD’s catch-up strategy faces major challenges: The MI400 72-GPU rack system was expected to compete with VR200 NVL144 on TCO, but NVIDIA has now raised VR200 memory bandwidth to 20.5TB/s, matching MI400. If MI400’s actual FP4 perf is the same or less than VR200 NVL144, AMD again falls behind.

According to SemiAnalysis, AMD lacks robust internal workload support, and must develop rack-level systems and refine software, while also opening a new front on prefill chips, to have hope of catching up with NVIDIA by 2027.

Suppliers like AWS Trainium3 and Meta MTIAv4, with internal workloads, have the advantage for developing prefill chips. But AWS faces technical hurdles since 1U tray space is limited and may need EFA NIC sidecar rack and external PCIe AEC cable solutions.

This article is from WeChat Official Account "Hard AI". For more AI news, click here

Risk Warning and DisclaimerThe market carries risks, and investments require caution. This article does not constitute personal investment advice, nor does it consider individual users' special financial goals, situations, or needs. Users should determine whether any opinions, views, or conclusions in this article fit their own circumstances. Investing accordingly is at one's own risk. ```