AI Chip "Division of Labor" Moment! Why Are There Two Versions of Google's 8th-Generation TPU?

AI Chip "Division of Labor" Moment! Why Are There Two Versions of Google's 8th-Generation TPU?

```

Google advances its AI chip strategy to a new stage.

At the Google Cloud Next 2026 conference held in Las Vegas on Wednesday, Google Cloud unveiled two new products of its eighth-generation Tensor Processing Unit (TPU)—the TPU 8t designed specifically for training, and the TPU 8i optimized for inference. This marks the first time Google has split training and inference tasks onto separate chips, signaling a major shift in its AI hardware roadmap.

Both chips are scheduled to become available later in 2026. Compared to the seventh-generation Ironwood TPU launched last November, the TPU 8t delivers 2.8 times better performance at the same cost, and the TPU 8i improves performance by 80%; both chips achieve more than double the performance per watt compared to the previous generation, with TPU 8t reaching 124% and TPU 8i achieving 117%.

Amin Vahdat, Google's Senior Vice President and CTO for AI and Infrastructure, stated that with the rise of AI agents, "the industry will benefit from chips optimized specifically for training and inference needs." Alphabet CEO Sundar Pichai also noted in a blog that this architecture aims to "deliver the massive throughput and low latency required to run millions of agents simultaneously in a cost-effective way."

Why Split into Two Chips

Dividing the eighth-generation TPU into two is a direct response by Google to the increasing differentiation of AI workloads. Pre-training, post-training, and real-time inference now differ significantly in computational characteristics: training seeks extreme throughput and scaling, while inference is more sensitive to latency and concurrency. A single chip can hardly optimize efficiency for both scenarios at the same time.

Google's technical blog notes that the design philosophy for the eighth-generation TPU centers on scalability, reliability, and efficiency. The two chips share the core genes of Google’s AI software stack, but each is specially optimized for different bottlenecks.

Both chips integrate Axion CPUs based on Arm architecture, eliminating host-side bottlenecks caused by data preprocessing delays and ensuring TPU compute units run at full capacity.

TPU 8t: Computing Engine for Large-Scale Training

TPU 8t is positioned as a dedicated accelerator for pre-training and embedding-intensive workloads. Google claims it can "compress the development cycle for cutting-edge models from months to weeks."

On scale, up to 9,600 TPU 8t chips can be combined into a single supercomputer node (superpod), and distributed training can be expanded to a cluster of over one million TPU chips through the JAX and Pathways frameworks.

At the chip level, TPU 8t introduces three key technological innovations.

First is the SparseCore accelerator, which handles the irregular memory access patterns in embedding lookups, offloading data-dependent global aggregation operations from the Matrix Multiplication Unit (MXU) to avoid zero-operation bottlenecks common in general chips.

Second is native FP4 support, which doubles MXU throughput using 4-bit floating point numbers, while reducing power consumption for data movement, enabling larger model layers to reside in local hardware buffers.

Third is a more balanced Vector Processing Unit (VPU) expansion design, allowing vector operations like quantization and softmax to overlap more seamlessly with matrix multiplication in the pipeline, improving overall chip utilization.

At the network level, Google introduces the new Virgo network architecture to TPU 8t, which uses high-radix switches and a two-layer flattened non-blocking topology, boosting the data center network (DCN) bandwidth up to 4 times over the previous generation and inter-chip interconnect (ICI) bandwidth by 2 times. A single Virgo network can connect over 134,000 TPU 8t chips, providing up to 47 petabits/second of non-blocking bidirectional bandwidth, with total compute power exceeding 1.6 million ExaFlops.

In storage, TPU 8t introduces TPUDirect RDMA and TPUDirect Storage technologies, which bypass the host CPU and directly transfer data between TPU high-bandwidth memory (HBM), network cards, and fast storage. Storage access speeds are 10 times faster than the seventh-generation Ironwood TPU, ensuring MXU remains fully loaded when processing massive multimodal datasets.

TPU 8i: Low-latency Expert for High-Concurrency Inference

TPU 8i is designed for post-training and high-concurrency inference scenarios, focused on reducing latency and increasing concurrent processing capability per chip.

On-chip storage is the most prominent hardware feature of TPU 8i. Each chip integrates 384MB of SRAM, three times more than Ironwood, allowing larger KV Cache to reside fully on-chip, significantly reducing idle time during long-context decoding, which is especially crucial for AI tasks requiring multi-step inference.

TPU 8i also introduces the Collective Acceleration Engine (CAE), which accelerates reduction and synchronization steps in self-regressive decoding and "chain-of-thought" processing. Each TPU 8i chip contains two Tensor Cores (TC) and one CAE tile, replacing the four SparseCore units in Ironwood. On-chip collective operation latency is reduced fivefold, directly boosting the throughput required for running millions of agents simultaneously.

In network topology, TPU 8i abandons the 3D torus structure used by TPU 8t and instead adopts the new Boardfly interconnect topology. In a 1024-chip configuration, the 3D torus requires up to 16 hops between any two chips; Boardfly's high-radix design compresses this to 7 hops, reducing network diameter by 56% and improving all-to-all communication latency by up to 50%. This is especially advantageous for frequent cross-chip token routing in Mixture of Experts (MoE) and inference models. Boardfly uses a hierarchical structure, expanding from four-chip building blocks up to a complete Pod with 1,152 chips, and implements inter-group connections via optical circuit switches (OCS).

Software Ecosystem and Market Significance

Google emphasizes that unleashing hardware performance relies on synergy with the supporting software stack.

The eighth-generation TPU continues the software system established with Ironwood, supporting mainstream frameworks such as JAX, PyTorch, Keras, and vLLM, also providing the Pallas custom kernel language to fully leverage the hardware potential of SparseCore and CAE.

Google also announced that native PyTorch support for TPUs is now in preview, allowing users to migrate existing PyTorch models to run on TPUs without any code modifications.

From a market perspective, Google’s dual-chip strategy directly addresses AI infrastructure cost pressures. Train

ing and inference have markedly different hardware requirements; a unified chip inevitably wastes resources in certain scenarios. Through dedicated optimization, Google achieves greater improvements in price-to-performance, providing cloud customers with more competitive unit computing costs.

Both chips are now included in Google Cloud's AI Hypercomputer supercomputing architecture, integrated with hardware, software, and network, covering AI workloads throughout their lifecycle.

Risk Warning and DisclaimerThe market has risks; investment needs caution. This article does not constitute personal investment advice and does not take into account the special investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article suit their specific circumstances. Investing based on this article is at your own risk. ```