SemiAnalysis: Downstream large model companies are already making huge profits; Nvidia and TSMC can earn even more

```

The AI value chain is undergoing a structural revaluation. Previously, most profits went to chip manufacturers, but now downstream model firms are rapidly catching up, while upstream profit margins are far from hitting a ceiling.

SemiAnalysis points out that Anthropic's annualized revenue soared from $9 billion to over $44 billion within months, with inference gross margin rising from 38% to over 70%. Nvidia’s current pricing framework is still cost-oriented and hasn’t reflected changes in inference workload economics. Once the framework is adjusted, Nvidia system pricing has more than 40% upside potential. TSMC’s N3 process capacity is also at the core of value redistribution.

The key to this judgment lies in the structural mismatch at both supply and demand ends: N3 process utilization is expected to exceed 100% in the second half of 2026, DRAM factories are already over 90% fully loaded, and token demand from frontier models is still expanding at a compound rate. Against this backdrop, Nvidia has opened a window for differentiated pricing via SOCAMM memory modules.

AI Value Low Shifts: Infrastructure Layer Gives Way to Model Layer

From 2023 to early 2025, most of the profits in the AI value chain accumulated in the infrastructure layer. Nvidia erupted first, followed by power assets Vistra and GE Vernova rising 265% and 146% respectively in 2024, and memory vendors Sandisk, Western Digital, Seagate, and Micron all achieving over 200% gains in 2025.

Behind this structure, model creators and inference service providers long suffered from low gross margin dilemmas. At that time, AI’s real application value was limited, and the market continuously questioned the return on AI investment.

The turning point appeared in December 2025. As Agentic AI (Agent AI) became truly practical, AI’s economic logic was thoroughly rewritten. SemiAnalysis revealed its annual token consumption expenditure was close to 30% of employee payroll, and each employee consumed nearly 5 billion tokens per month, more than 5 times Meta's per capita usage. Many tasks that used to require hours from junior analysts—financial modeling, data visualization, profitability analysis—now require only a few dollars of token expenditure.

SemiAnalysis estimated their team’s annualized spend on Anthropic Claude reached $10.95 million, and the competitive advantage gained far outweighed the cost. Anthropic benefited immediately: ARR skyrocketed from $9 billion to over $44 billion, and inference gross margin rose from 38% to over 70%.

Token Cost Plummets, Model Vendor Profit Margins Expand Sustainably

The other core factor driving margin jumps for model firms is the significant drop in token production costs.

From the hardware perspective, for standard inference tasks with 8K input and 1K output, a fully software-optimized (including wide EP, computation and prefetch separation, multi-token prediction) B300 system produces about 14,000 tokens per GPU per second, while the unoptimized version only about 1,000—software optimization alone gives 14x throughput boost on the same hardware. With hardware upgrades, the best-configured GB300 NVL72 delivers about 17x FP8 throughput over H100; switching to FP4 precision, unsupported by H100, boosts the gap to 32x, and GB300’s total cost per GPU rises only about 70%.

From the pricing side, Agentic workloads have extremely high input-output ratios (Claude Code scenarios are about 300:1) and extremely high cache hit rates (above 90%), meaning most tokens fall into the lowest pricing tier. SemiAnalysis estimates that Opus 4.7’s actual blended cost for agent tasks is $0.99 per million tokens, far lower than the list price of $5 per million input tokens.

Even facing Anthropic’s drastic price cuts for the Opus series—Opus 4.5’s pricing is two-thirds lower—SemiAnalysis believes Anthropic’s per-unit gross margin actually rose: production costs fall further with hardware upgrades, and mass migration from Sonnet to Opus pushed up blended ASP.

More strategically, Anthropic still controls pricing for its high-end product line. Opus Fast is priced at 6 times regular Opus, and the announced Mythos is $25/$125 per million tokens—5 times regular Opus. SemiAnalysis explicitly says even if Anthropic opens Mythos Fast at $150/$750 per million tokens, their team would still buy—because the productivity gain far exceeds the cost.

Why Model Vendors' Pricing Power Is Hard to Erode by Competition

The most common challenge to the sustainability of high margin for frontier models is competitive pressure. SemiAnalysis rebuts this with two reasons.

First, the gap in capability between frontier closed-source models and open-source models remains significant and hard to close in the short term. Low-priced open-source models like Kimi K2.6 ($0.95/$4 per million tokens) do not materially pressure Opus pricing.

Second, compute constraints mean no single frontier lab can serve the entire market alone. Anthropic actively manages demand by locking Claude Code behind a $100+ monthly subscription and restricting 3rd-party access. Token demand will continue to exceed supply for the foreseeable future, meaning labs capable of providing truly frontier quality can price based on the economic value of tokens, not competitive cost.

Nvidia's Restrained Pricing: Regulatory Logic or Strategic Misjudgment?

In the face of profound restructuring of the AI value chain, Nvidia has yet to make substantive adjustments to its pricing framework, which is a structural issue worth noting.

Nvidia’s current pricing is still anchored mainly on cost, reflecting an outdated paradigm of declining demand value over time—which no longer holds. Current demand growth is not linear, but compound, driven by agent workload explosions and persistent jumps in token consumption per workflow.

SemiAnalysis thinks Nvidia maintains restrained pricing partly out of regulatory concerns. Nvidia's dominance in GPUs, interconnect, and software stacks has led to mounting antitrust attention. When downstream AI labs are also highly profitable, aggressive price hikes could escalate regulatory risk or accelerate client migration to TPUs, Trainium, and other alternatives.

In this sense, Nvidia’s behavior parallels TSMC. Even when running at full capacity as an advanced process supply bottleneck, TSMC has not raised pricing to the maximum scarcity premium, instead prioritizing ecosystem stability and customer relations. This logic can be summarized as an "AI Central Bank"—supporting downstream ecosystem expansion via moderate profit-sharing rather than maximizing short-term profit extraction, to secure long-term dominance in the AI era.

However, this strategy has real opportunity costs. Under structural conditions of compute demand outstripping supply, possessing scarce resources but not fully pricing them is tantamount to handing value to the middle and lower parts of the chain. TSMC did similarly on N3—SemiAnalysis explicitly calls this "strategic misjudgment," and at least larger-scale upfront payment arrangements should be required.

Rubin Pricing Space: SOCAMM Becomes New Profit Lever

Nvidia’s imminent Vera Rubin VR NVL72 system offers a chance to re-examine pricing frameworks.

From the cost perspective, calculations show that for VR NVL72 to achieve the same 15.6% project IRR (5-year term, 15% upfront payment) as GB300 NVL72, minimum GPU rental needs to be about $4.92 per hour. From the value perspective, anchoring to FP8 compute intensity dimension, current GB300 PFLOP rental is about $0.70, VR NVL72’s theoretical maximum price per GPU per hour is $12.25—about 2.5 times the cost floor.

This huge spread indicates Nvidia has ample room to raise VR NVL72 pricing. SemiAnalysis estimates even if Nvidia hikes system pricing about 40%, Neocloud retains enough profit margin—even if Neocloud raises rent to above $8/hour, per-PFLOP cost is still under historical trendlines.

In terms of mechanism, SOCAMM is the key pricing lever. Unlike GB300, which solders LPDDR5X memory onto the board and embeds it in overall system pricing, VR NVL72 uses pluggable SOCAMM modules, allowing Nvidia to separately itemize and price memory.

SOCAMM (Small Outline Compression Attached Memory Module) is Nvidia-led, jointly developed with Samsung, SK Hynix, Micron, and other memory vendors, a new modular memory standard based on LPDDR5X (future: LPDDR6) DRAM tech, targeting AI servers and personal AI supercomputing farms.

Model shows Nvidia’s Q1 2026 SOCAMM contract price is about $8/GB, a sharp rise from last quarter, reflecting LPDDR5X supply tightening and overall DRAM price escalation. Based on predictions for mobile DRAM pricing at end 2026, SOCAMM price may exceed $13/GB, with an annual average of $10/GB a reasonable estimate.

On this basis, SemiAnalysis thinks Nvidia charging a 60% gross margin on SOCAMM is reasonable: First, memory supply is fully tight, Nvidia has priority access for SOCAMM procurement; second, VR NVL72 far exceeds contemporaries in performance/TCO, customers lack alternatives; third, Nvidia itself faces sharply rising SOCAMM procurement costs, so passing them downstream has rationale.

Additionally, memory pricing does not face antitrust concerns like GPU pricing, giving Nvidia broader differentiated pricing room—including different prices for Neocloud and hyperscale cloud vendors. Currently, Nvidia charges Neocloud about twice as much for network hardware as for hyperscale vendors, which can be extended to memory pricing.

Risk Disclaimer and TermsThe market carries risk; invest with caution. This article does not constitute personal investment advice and does not consider individual user’s investment objectives, financial status, or needs. Users should assess whether any opinions, viewpoints, or conclusions in this article fit their specific circumstances. Investing based on this is at your own risk. ```