How AI inference creates new memory demands

The arrival of the AI inference era is fundamentally reshaping the demand landscape of the semiconductor memory industry. As the average number of output tokens per question surges at a rate of over 5 times per year, the memory needs brought by KV cache management and agentic AI deployment have become the most challenging and most promising new areas in AI infrastructure.

At the GTC Taipei conference held in June 2026, NVIDIA founder and CEO Jensen Huang explicitly pointed out, "AI's memory system will completely transform the storage system," and listed memory systems as one of the most challenging parts of AI infrastructure. This judgment directly points to two structural demand drivers: first, the KV cache offload demand generated by inference workloads; second, the expansion of CPU memory demand brought by the rise of agentic AI.

The impact of these trends on the storage industry chain has already begun to emerge. NVIDIA has successively launched the Dynamo software platform and the CMX Context Memory Storage Platform, while major chip manufacturers such as Arm, Intel, and AMD have intensively released next-generation CPU products for agentic AI in 2026. The industry is accelerating the transition from throughput-oriented architectures to low-latency-oriented architectures.

Inference-side Expansion: Explosive Token Growth Reshapes Hardware Demand

The hardware requirements at the AI inference stage differ fundamentally from those during the training stage.

According to public data from NVIDIA, since the second half of 2024, the average number of output tokens per question has surged at a rate of more than 5 times per year, reaching around 30,000 to 40,000 tokens. This trend indicates the industry has entered the "thinking" phase (Test-time Scaling) of NVIDIA's "three scaling laws" on the inference side.

According to TrendForce analysis, AI inference brings three core hardware demands: higher queries per second (QPS), longer context windows, and more inference steps and agentic loops. These three demands drive structural memory changes from different dimensions, specifically in model weights, KV cache, and agentic AI.

Model weights are statically allocated memory, with consumption directly tied to the scale of model parameters. The calculation formula is: total model weights size = number of parameters × bytes per parameter. As models scale up, this static allocation forms the foundational layer of memory demand for inference systems.

KV Cache: Dynamic Expansion Spawns Offloading Technology and New SSD POD Market

KV cache is the core source of memory pressure during the inference stage.

The KV cache stores key-value vectors generated during the inference prefill stage to avoid redundant computation during decoding, and is dynamically allocated memory. Its total size is determined by layer count, KV heads, per-head dimension, sequence length, batch size, and precision; it swells nonlinearly as conversation length and batch size grow.

In inference scenarios with long context and high batch processing, when the GPU's HBM capacity is insufficient, the system is forced to discard KV cache and re-perform prefill computation, leading to increased latency and elevated total cost of ownership (TCO).

To address this bottleneck, NVIDIA released the KV cache offload software Dynamo in March 2025, which offloads low-frequency KV cache to CPU memory and SSD and other higher-capacity, lower-cost storage tiers, ensuring the data remains reusable during the decoding stage.

Paired with Dynamo, NVIDIA launched the CMX Context Memory Storage Platform in January 2026, managed by the BlueField-4 DPU, built on BlueField-4 STX racks. It uses 64 BlueField-4 DPUs to manage about 9,600 TB of capacity per rack, adding a G3.5-level pod-context storage tier between local SSD (G3 tier) and shared storage (G4 tier).

It is noteworthy that in the BlueField-4 DPU structure model showcased at COMPUTEX 2026, SK Hynix's PEB210 E1.S and PE9010 M.2 SSD samples have already been equipped. As NVIDIA, Google and other vendors launch SSD POD platforms, demand for this niche market is expected to continue to rise.

Agentic AI: CPU & GPU Ratio Restructured to 1:1, LPDRAM Demand Expands

The large-scale deployment of agentic AI is triggering another deep change in AI server architecture.

In agentic AI workflows, models must actively carry out planning, tool calls, decision-making and agent operations. All orchestration, data routing, and sub-agent evaluation tasks are handled by the CPU. Jensen Huang pointed out that agents live in a nanosecond world, where ultra-low latency is the primary requirement; this greatly elevates the importance of CPU architecture.

TrendForce predicts that as the scale of agentic AI deployment expands, the CPU-to-GPU workload ratio will shift from the traditional 1:4 or 1:8 to about 1:1, creating significant incremental space for the CPU market and driving corresponding structural growth in CPU memory demand.

In 2026, NVIDIA launched the Vera CPU, designed for agentic AI workloads. According to initial specs, Vera supports up to 1.5 TB of LPDDR5X memory, three times that of the previous Grace CPU.

However, TrendForce's latest survey shows NVIDIA has decided to halve the SOCAMM memory capacity of the next-generation Vera Rubin superchip module, due to insufficient LPDRAM capacity allocation from suppliers in the preliminary 2027 production plan. This adjustment does not reflect any overall decrease in NVIDIA's memory demand.

In the broader CPU market, 2026 is becoming the year of comprehensive product upgrades for agentic AI. Intel has launched Xeon 6+ (Clearwater Forest), AMD has released EPYC Venice, Arm has launched the Arm AGI CPU, and Ampere’s AmpereOne MX is expected to go into mass production within the year. The emergence of this multifaceted competition will further accelerate the unleashing of CPU memory demand.

Dual Drivers Resonate: Storage Industry Chain Welcomes Structural Opportunities

Overall, AI inference is reshaping the memory demand landscape from two independent yet synergistic dimensions.

First, inference workload drives rapid expansion in KV cache consumption. KV cache offloading technology is redirecting vast amounts of data to CPU memory and SSD POD; as related platforms accelerate their rollout, visibility of demand in this niche market continues to increase.

Second, agentic AI is shifting the CPU-GPU workload ratio toward 1:1, creating incremental market space for CPUs and matched LPDRAM as never before.

For storage industry chain investors, these trends mean that besides HBM, enterprise SSDs, LPDRAM, and related DPU-supported storage products are becoming new focal points for AI infrastructure investment.

Risk Warning and DisclaimerThe market has risks; investment needs caution. This article does not constitute personal investment advice, nor has it taken individual users’ specific investment objectives, financial status, or needs into account. Users should consider whether any opinions, views, or conclusions in this article suit their particular circumstances. Invest accordingly at your own risk.