Summary of Hot Chips 2025 Key Points: Google TPU Performance Surge, Meta's Computing Power Investments, Optical Modules, Ethernet Driving Scale Up, and More

```

The growth in AI demand is far from slowing down, and multiple technological breakthroughs are reshaping the industry landscape.

On September 3, JPMorgan Chase stated in its latest research report that after attending the Hot Chips 2025 conference, analysts believe the explosive growth of AI on both the consumer and enterprise ends will continue to drive a strong multi-year demand cycle for advanced computing, memory, and networking technologies.

According to the report, every session at the conference emphasized that AI is the most important driving force behind technological progress and product demand. The core message conveyed is: the momentum of AI infrastructure demand growth remains strong and is expanding from simple computing power competition to comprehensive upgrades in networking and optical technologies. The bank believes the following key trends are worth noting:

Google’s Ironwood TPU has made a significant leap in performance, rapidly narrowing the gap with Nvidia GPUs;

Meta expands the 100k+ GPU cluster scale, expected to increase 10-fold over the next decade;

Networking technology has become a key growth point for AI infrastructure, with Ethernet expanding into the Scale-up field;

Optical integration technology is developing rapidly to address power consumption constraints.

Google Ironwood TPU: Performance Leap Narrows Gap with GPUs

JPMorgan notes that Google unveiled the latest details of the Ironwood TPU (TPU v6) at the conference, demonstrating impressive performance improvements. Compared to the TPU v5p, Ironwood’s peak FLOPS performance has increased by about 10 times, with efficiency improving by 5.6 times.

Storage capacity and bandwidth have also greatly improved: Ironwood is equipped with 192GB HBM3E memory and bandwidth of 7.3TB/s, significantly higher than TPU v5p’s 96GB HBM2 and 2.8TB/s bandwidth.

The Ironwood supercluster can be scaled to 9,216 chips (significantly up from the previous 4,096), with 144 racks, each with 64 chips, totaling 1.77PB of directly addressable HBM memory and 42.5 exaflops FP8 computing power.

Performance comparison shows: Ironwood’s 4.2 TFLOPS/watt efficiency is only slightly lower than Nvidia’s B200/300 GPU’s 4.5 TFLOPS/watt. JPMorgan states:

This data highlights that advanced AI-specific chips are rapidly narrowing the performance gap with leading GPUs, driving hyperscale cloud service providers to increase investment in custom ASIC projects.

According to JPMorgan’s forecast, this chip, developed in partnership with Broadcom using a 3-nanometer process, will enter mass production in the second half of 2025. Ironwood is expected to bring $9 billion in revenue to Broadcom in the next 6-7 months, with lifetime total revenue exceeding $15 billion.

Meta’s Custom Deployments Highlight MGX Architecture Advantages

The report notes that Meta gave a detailed introduction to the architecture design of its custom NVL72 system, Catalina. Unlike Nvidia’s standard NVL72 reference design, Catalina is distributed across two IT racks and comes with four auxiliary cooling racks.

In terms of internal configuration, each B200 GPU is paired with a Grace CPU, instead of the standard 2 B200s per Grace CPU configuration. This design doubles the number of Grace CPUs in the system to 72, boosts LPDDR memory from 17.3TB to 34.6TB, and increases total cache-coherent memory from 30TB to 48TB—a 60% increase.

Meta says that choosing a custom NVL72 design is mainly based on model requirements and physical infrastructure considerations. Model requirements cover not only large language models but also ranking and recommendation engines. On the infrastructure side, these power-hungry systems need to be deployed in traditional data center environments.

Meta emphasized that Nvidia’s MGX modular reference design, which meets OCP standards, allows customers to customize based on personalized architecture requirements.

Network Technology Takes Center Stage, Scale Up Brings New Opportunities

Networking technology became a key topic at the conference, with significant growth opportunities emerging in both Scale Up and Scale Out domains.

Broadcom highlighted its newly launched 51.2TB/s Tomahawk Ultra switch, describing it as “a low-latency Scale Up switch purpose-built for HPC and AI applications.”

The Tomahawk Ultra succeeds Broadcom’s 102.4TB/s Tomahawk 6 switch and supports the company's strategy to promote Ethernet adoption in both Scale Up and Scale Out fields.

Analysts noted that Scale Up represents a major opportunity for Broadcom’s TAM expansion, especially as hyperscale cloud service providers deploy ever larger XPU clusters.

Nvidia continues its Ethernet strategy, launching “Spectrum-XGS” Ethernet technology aimed at addressing cross-scale opportunities generated from customer distributed clusters running across multiple data centers.

Nvidia states that Spectrum-XGS has a number of advantages over existing Ethernet solutions, including unlimited scaling and automatic load balancing, and announced that CoreWeave is the first customer to deploy this technology.

Deep Optical Integration to Address Power and Cost Challenges

Optical technology was another focus of the conference. Multiple speakers emphasized the key drivers pushing for deep integration of optical technology into AI infrastructure, including the limitations of copper interconnects, rapidly rising rack power density, and the relatively high cost and power consumption of optical transceivers.

Lightmatter showcased its Passage M1000 “AI 3D photonic interposer,” addressing the challenge that I/O connections located at chip peripheries scale less efficiently than chip performance. The core of the M1000 is an active multi-mask photonic interposer spanning 4000mm², capable of creating large chip complexes within a single package.

Ayar Labs discussed its TeraPHY optical I/O chip for AI Scale Up, the first implementation of the UCIe optical retimer, which ensures compatibility and interoperability with chips from other manufacturers. This technology supports up to 8.192TB/s of bidirectional bandwidth, with power efficiency 4-8 times higher than traditional pluggable optics plus electrical SerDes.

Despite CPO and other leading photonics technologies not yet being widely deployed, analysts expect data center power consumption limitations to become a key driver for widespread adoption in 2027-2028. The optical waveguides of the M1000 are distributed across the chip surface, eliminating the "coastline" limitation of traditional designs while consuming significantly less power than electrical signaling.

AMD Product Line Expansion, MI400 Series Coming in 2026

AMD gave an in-depth introduction to the technical details of its MI350 GPU series at the conference. The MI355X operates at higher TBP and maximum clock frequency, with a TBP of 1.4kW and a clock frequency of 2.4GHz, while the MI350X is at 1.0kW and 2.2GHz.

Thus, MI355X is mainly deployed in liquid-cooled data center infrastructures, while MI350X primarily serves customers with traditional air-cooled setups.

In terms of performance, the MI355X offers 9% higher compute performance compared to the MI350X, but memory capacity and bandwidth per chip remain unchanged.

For deployment, the MI355X can be installed in rack systems with up to 128 GPUs, while the MI350X rack supports up to 64 GPUs—mainly due to the different thermal management capabilities of air vs. liquid cooling. However, both maintain a Scale Up domain of 8 GPUs.

AMD reiterated that the MI400 series and its "Helios" rack solution will be launched as scheduled in 2026, with JPMorgan expecting a launch in the second half of 2026, and the MI500 series planned for release in 2027.

JPMorgan analysts believe that AMD is well positioned in the inference compute market, where demand growth exceeds that of the training market, and that its products offer compelling performance and total cost of ownership advantages over Nvidia alternatives.

This article is from WeChat Official Account "Hard AI". For more AI frontier information, please go here

Risk Warning and DisclaimerThe market involves risk, investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment objectives, financial situation or needs of any particular user. Users should consider whether any opinions, viewpoints, or conclusions in this article are appropriate for their individual circumstances. Investing accordingly is at your own risk. ```