SemiAnalysis Exclusive Breakdown: Full Details of the Blackwell Architecture, Nvidia's Never-Before-Revealed Secrets

SemiAnalysis Exclusive Breakdown: Full Details of the Blackwell Architecture, Nvidia's Never-Before-Revealed Secrets

```

Nvidia's Blackwell GPU represents one of the most significant GPU microarchitecture changes in recent years, but so far lacks a comprehensive official white paper.

Renowned semiconductor research institute SemiAnalysis spent months conducting systematic micro-benchmarking of the Blackwell architecture, publicly revealing for the first time the hardware performance ceiling for AI workloads.

Test results show that Blackwell approaches theoretical peak values in key areas such as Tensor Core throughput, memory subsystem bandwidth, and the new 2SM MMA instructions, but performance is highly dependent on instruction shape configuration, with noticeable bandwidth bottlenecks in certain scenarios. This finding offers direct reference value for AI infrastructure investors and chip buyers—the full potential of the architecture hinges on fine-tuning at the software layer.

SemiAnalysis has open-sourced the related benchmark code base; the B200 nodes used for testing were provided by Nebius and Verda. The research team also announced plans to expand future benchmarks to TPU Pallas cores, Trainium NKI cores, and AMD CDNA4 assembly.

Core Architectural Changes: TMEM Introduction and 2SM MMA

From Hopper to Blackwell, Nvidia has made several significant changes to the PTX abstraction layer for MMA-related instructions.

The most notable change is the introduction of Tensor Memory (TMEM) for storing MMA accumulators. In previous architectures, threads implicitly held MMA operation results; in Blackwell, the software explicitly manages TMEM within the MMA scope, altering the relationship between threads and computation results.

Meanwhile, the tcgen05 operation is now issued by a single thread representing the entire CTA (Cooperative Thread Array), instead of being issued per warp or warpgroup as in Hopper. This is directly reflected in CuTe MMA atoms: Blackwell uses ThrID = Layout<_1>, whereas Hopper uses ThrID = Layout<_128>.

Blackwell has also introduced TMA and MMA for the TPC scope, enabling two cooperating CTAs to execute tcgen05.mma across SMs, sharing operands, thereby reducing shared memory bandwidth needs per CTA and providing higher computational intensity MMA instructions. Additionally, the architecture natively supports sub-byte data types with micro-scaling and introduces Cluster Launch Control (CLC) as hardware support for dynamic work scheduling in persistent CTA kernels.

Chip Physical Layout: Dual Die Architecture and 300 Cycle Cross-Die Latency

SemiAnalysis used reverse engineering to reveal the physical topology structure of the B200 chip.

The team utilized the PTX %%smid instruction to reverse infer the mapping relationship from SMs to GPCs (Graphics Processing Clusters) by launching clusters of various sizes. Results show that some TPCs on B200 exclusively occupy logical GPCs and never co-schedule with other TPCs.

By letting each SM traverse pointer tracking arrays that fill the L2 cache and measuring access latency between SMs, the team built an SM distance matrix. The matrix clearly shows two groups of SMs, with average L2 access latency exceeding 300 clock cycles, corresponding exactly to the cross-die access penalty between the two dies.

Accordingly, the team inferred the die-level TPC distribution on B200 as follows:

  • Die A: Each GPC contains respectively 10, 10, 10, and 9 TPCs
  • Die B: Each GPC contains respectively 9, 9, 9, and 5+3 TPCs

This physical layout difference means that even two GPUs with identical logical configurations may have different physical SM distributions, becoming a potential source of performance non-determinism.

Memory Subsystem: Performance Boundaries of LDGSTS and TMA

Memory subsystem testing focuses on two types of asynchronous copy instructions: LDGSTS (asynchronous copy) and TMA (Tensor Memory Accelerator).

On LDGSTS, tests covered typical configurations of FlashInfer multi-head attention (MHA) kernels. Results show that LDGSTS memory throughput saturates at 32 KiB in-flight bytes, peaking at about 6.6 TB/s. Loading 16 bytes at the same in-flight byte count is slightly better than loading 8 bytes and consumes fewer execution resources. Latency tests show LDGSTS baseline latency at about 600 nanoseconds, doubling when in-flight bytes exceed 8 KiB, due to large numbers of threads stalling from MIO (memory input-output) throttling.

On TMA, peak throughput is reached later than LDGSTS. With in-flight data below 32 bytes, asynchronous copy throughput is slightly superior to TMA; above this threshold TMA catches up and can sustain scaling up to 128 KiB. For latency, asynchronous copy is slightly lower when in-flight data is below 12 KiB, but above that, TMA latency increases sharply.

TMA multicast testing shows that explicit TMA multicast can perfectly eliminate L2 traffic, achieving the ideal "1/cluster size" L2 byte ratio. Implicit multicast (each CTA individually issues a TMA load to the same data) is comparable to explicit multicast in effective memory throughput, but after in-flight bytes exceed 64, the L2 cache traffic reduction starts to degrade.

Tensor Core Performance: Significant Shape Dependence, 2SM MMA Achieves Perfect Weak Scaling

Tensor Core testing is the core part of this study, revealing Blackwell MMA performance's high sensitivity to instruction shape.

Throughput: For 1SM MMA, configuration with M=64 reaches only 50% of theoretical peak, while M=128 nears 100%. This confirms M=64 only uses half the data pathway. For 2SM MMA, M=128 at N=64 hits 90% of peak and other N sizes are close to 100%; M=256 maintains near-100% peak throughput across all settings because M=256 is equivalent to each SM processing M=128, fully utilizing the data pathways.

AB layout effects are also significant. When both input matrices are stored in shared memory (SS mode), for M=128 and N<128, there is a notable SMEM bandwidth bottleneck. For FP16, the hardware can execute 8192 MMA FLOP per cycle, SMEM bandwidth is 128 bytes/cycle, calculations show M=128 N=64 K=16 configuration requires 48 cycles for SMEM, while computation needs only 32, so the instruction is limited by SMEM bandwidth. This pattern holds for all data types—MMA instructions with both operands fully in SMEM for N<128 are bandwidth-constrained.

2SM MMA achieves perfect weak scaling, doubling speed with double compute resources compared to 1SM MMA. In SS-mode small-shape configurations, since operand B is split across two SMs, more than 2x speedup can occur. Study conclusion: Always use the largest instruction shape available under a given SMEM tile size for maximum throughput.

Latency: For all configurations, latency increases linearly as N goes from 64 to 128, and jumps at N=256. Latency by data type follows a pattern: S8 < BF16 = E4M3 = F4 < MXF8 = MXF4; the team believes higher power efficiency in integer computation explains why S8 is fastest, and scaling factor calculation in micro-scaled data types adds slight overhead.

Actual in-flight instruction test shows that for 1 to 4 in-flight MMA instructions in typical kernels, the throughput ceiling with 4 concurrent MMAs is about 78% to 80% of theoretical peak, and 1SM MMA is about 5 percentage points higher than 2SM MMA.

Risk Warning and DisclaimerThe market carries risks; investments should be made cautiously. This article does not constitute personal investment advice, nor does it take into account individual users' specific investment objectives, financial circumstances, or needs. Users should consider whether any opinions, views, or conclusions in this article are suitable for their situation. Investment decisions made based on this are at your own risk. ```