Microsoft's first "AI super factory" begins operation: connecting two data centers to build a distributed network

Microsoft's first "AI super factory" begins operation: connecting two data centers to build a distributed network

```

Microsoft is opening a new chapter in its AI infrastructure, building a collaborative distributed "AI super factory" by connecting large data centers across different states. This strategy aims to accelerate AI model training at unprecedented scale and speed, marking a shift in the industry’s competition to meet explosive computing power demands from site-centric to network-oriented layouts.

According to Microsoft, its next-generation AI data center in Atlanta officially began operations in October this year. This is the second facility in Microsoft’s “Fairwater” series and is already connected via a dedicated high-speed network to another data center announced in Wisconsin. This means Microsoft’s first cross-state collaborative AI computing cluster is now operational, able to reduce complex AI training tasks that would have taken several months down to just weeks.

This move comes amid the intensifying "AI arms race" among tech giants. According to The Wall Street Journal, Microsoft plans to double its total data center area in the next two years to meet surging computing power demands. The new “AI super factory” network will support not only OpenAI, Microsoft’s own AI superintelligence teams, and Copilot, but also key clients such as France’s Mistral AI and Elon Musk’s xAI, highlighting its core position in the AI infrastructure field.

Behind this massive construction plan lies enormous capital expenditure. In the most recent fiscal quarter, Microsoft’s capital expenditure exceeded $34 billion, and it expects to further increase investments over the coming year. Across the industry, total AI-related investment is projected to reach $400 billion this year. Against this backdrop, Microsoft’s distributed network strategy represents not only technological innovation, but also a key step in cementing its market leadership amid fierce competition.

“AI Super Factory”: From Independent Sites to Distributed Networks

The core of Microsoft’s “AI super factory” concept is to integrate multiple geographically dispersed data centers into a single virtual supercomputer, a thoroughly different approach from traditional data center architecture.

Microsoft Azure Infrastructure General Manager Alistair Speirs explained: “Traditional data centers are designed to run millions of independent applications for multiple customers, but we call this an ‘AI super factory’ because it runs one complex job across millions of pieces of hardware.” In this model, AI model training is no longer confined to a single site, but is jointly supported by a network of sites.

This distributed network connects multiple sites and combines hundreds of thousands of state-of-the-art GPUs, exabyte-level storage, and millions of CPU cores. Its design goal is to support future AI model training with trillions of parameters. As AI training workflows become more complex, covering pre-training, fine-tuning, reinforcement learning, and evaluation stages, this cross-site collaborative capability becomes critical.

Purpose-built for AI: Next-Generation Data Center Design and Technology

To realize the super factory vision, Microsoft designed the “Fairwater” series of data centers from scratch. The Atlanta facility covers 85 acres, with over 1 million square feet of building area, and its design is fully optimized for AI workloads.

Key technological features include:

High-density architecture: Innovative two-level building design allows for more GPUs in less physical space, reducing internal communication latency.

Cutting-edge chip systems: Deploys Nvidia’s GB200 NVL72 rack-scale system, scalable to hundreds of thousands of Nvidia Blackwell architecture GPUs.

Efficient liquid cooling: To deal with the massive heat from GPU clusters, Microsoft designed a sophisticated closed-loop liquid cooling system. This system consumes almost no water resources, with its initial fill equivalent to a year’s water use by 20 American households.

Internal high-speed interconnect: Within the data center, a high-speed network tightly connects all GPUs, ensuring rapid data flow between chips.

“To be a leader in AI is not just about adding more GPUs, but building the infrastructure for them to work as a unified system,” said Scott Guthrie, Microsoft’s Executive VP of Cloud and AI. He emphasized that Fairwater’s design embodies years of end-to-end engineering experience at Microsoft and is aimed at meeting rapidly growing demands with real-world performance.

Connecting Multiple States: AI WAN and Compute Allocation Strategies

Linking far-flung data centers into a unified whole relies on Microsoft’s custom-built AI Wide Area Network (AI WAN). Microsoft deployed 120,000 miles of dedicated fiber-optic cable, creating a “highway” for AI traffic so data can transmit at near light speed without congestion.

Microsoft Azure CTO Mark Russinovich points out that as model sizes grow, the compute needed for training far exceeds the capacity of any single data center. If any part of the network encounters a bottleneck, the entire training job stalls. The aim of the Fairwater network is to keep all GPUs busy all the time.

The reason for cross-state construction, instead of concentrating all computing power in one place, is due primarily to considerations of land and electricity supply. Alistair Speirs told the Wall Street Journal that dispersing power demand across different regions avoids overburdening any single grid or community. He admitted, “You have to be able to train across multiple regions, because no one has ever reached our scale, so no one has really encountered this problem before.”

“Arms Race” Amid Surging Demand

Microsoft’s “AI super factory” is its core asset for coping with soaring compute demand and competing with rivals. Although Microsoft previously adjusted some of its data center leasing plans, Alistair Speirs clarified that this was simply a shift in capacity planning, and current demand far exceeds what the company can supply.

Microsoft is not alone in this compute race. Its main competitor Amazon recently launched the Project Rainier data center cluster in Indiana, covering 1,200 acres and expected to consume 2.2 gigawatts of power. In addition, Meta Platforms, Oracle, and others have announced expansive building plans, while AI startup Anthropic declared plans to invest $50 billion in computing infrastructure in the US.

By linking data centers into a unified distributed system, Microsoft is forging a new technological path and is commercially prepared to meet the massive demands of top AI companies. As Scott Guthrie said: “We get AI sites operating as a whole, enabling our customers to turn breakthrough models into reality.”

Risk disclaimerMarkets are risky; investment requires caution. This article does not constitute personal investment advice and does not take into account individual users’ specific investment objectives, financial situations, or needs. Users should consider whether any opinions, views, or conclusions in this article fit their particular circumstances. Investing accordingly is at your own risk. ```