Fatal Risks in the AI Era: CME Data Center Outage Reveals Cooling System Concerns

Fatal Risks in the AI Era: CME Data Center Outage Reveals Cooling System Concerns

```

A trading interruption at the Chicago Mercantile Exchange (CME) has brought data center cooling issues into the public spotlight.

On November 27, the trading platform of CME Group, the world’s largest futures exchange operator, suffered hours-long outages that affected trillions of dollars’ worth of contracts across stocks, forex, bonds, and commodities.

The direct cause of the incident was a malfunction in the cooling system at its data center in Aurora, Illinois. The data center is owned by CyrusOne, an operator under private equity firm KKR & Co. and Global Infrastructure Partners.

CyrusOne stated that a chiller unit in its facility failed, affecting several cooling units. This “simple” physical malfunction triggered turmoil in global markets. To prevent equipment overheating, capital expenditures for cooling systems can account for up to 15% of a data center project's total investment.

This event is not just an isolated technical glitch. Against the backdrop of the AI boom that briefly made Nvidia the world’s most valuable company, the issue of data center heat dissipation has become increasingly prominent.

Where does the heat come from?

Data centers are buildings filled with servers, comprised of stacks of chips working together to process and store data.

Processing capability, often called “compute,” has become a key commodity essential for AI companies to train models.

Data centers profit by renting out compute to other companies, which means operators are incentivized to fit as many servers as possible into the same space to maximize capacity.

All these servers require large amounts of electricity.

Due to their high power consumption and the need for around-the-clock operation, a data center’s energy use per square foot is up to 50 times that of a typical office building.

Much of the energy they consume ultimately dissipates as waste heat. This is similar to how a personal laptop or smartphone heats up while handling complex tasks.

Cooling Technologies and Trade-offs

Traditionally, servers are cooled with cold air, using a principle similar to household air conditioning.

Fans blow cold air toward servers and expel hot air from the server rooms. However, as data centers for AI generate more heat, liquid cooling systems have become more common since roughly 2022.

There are various liquid cooling methods, such as channeling cold liquids through pipes attached to heat sinks next to chips, or immersing entire servers in containers filled with coolant.

Some systems use low-boiling-point liquids which absorb heat and evaporate upon contact with high-temperature chips, later condensing back into liquid to be recirculated.

Compared with air, liquids can carry more heat per unit volume, making them more efficient. However, these systems are complex to install and expensive, and can be tricky to manage if issues arise — nobody wants costly chips soaked in liquid.

Whether using air or liquid, after heat is transferred from chips, it eventually passes into a circulating water system and is released to the outside environment via cooling towers or industrial chillers.

This is why data centers consume large amounts of water and have raised concerns about exacerbating water stress in drought-prone areas.

The Cost of Overheating

Overheating at data centers can lead to data loss, damage costly chips inside servers, and cause service disruption for clients.

The consequences are similar to recent service interruptions at several digital infrastructure providers due to technical failures.

For example, cybersecurity firm Cloudflare Inc. suffered a major network outage last November, rendering sites from social platform X to ChatGPT inaccessible. Amazon Web Services, CrowdStrike, and Microsoft have also experienced similar problems.

Typically, data centers invest heavily in redundancy, such as backup generators, extra cooling units, or even replicating entire facilities, to minimize the risk of outages.

But as systems grow more complex, interruptions can remain difficult to avoid despite redundancy measures.

Review of the CME Incident

CME’s trading platform is located in a campus in Aurora, a suburb of Chicago, which belongs to data center operator CyrusOne.

According to CyrusOne, on November 27 a chiller unit at its Aurora facility malfunctioned, affecting several cooling units, and ultimately causing the trading outage.

After the incident, CyrusOne said it had deployed temporary cooling equipment to supplement permanent systems while working to restore full cooling capacity.

According to information on the company’s website, its Aurora campus features “advanced cooling technology,” using air-cooled chillers and leveraging natural cold air or water for cooling when temperatures fall below 30°F (about -1°C).

Weather forecast data shows that at 10:40 a.m. on November 28, Aurora’s local temperature was about 28°F.

It is worth noting that CyrusOne’s website also claims its Aurora facility has additional cooling units to handle chiller failures.

It is unclear whether the redundancy system functioned as intended in this incident.

Risk Warning and DisclaimerThe market involves risk; investments require caution. This article does not constitute individual investment advice, nor does it consider specific investment objectives, financial situations, or needs of individual users. Users should consider whether any opinions, viewpoints, or conclusions in this article fit their particular circumstances. Investment is at your own risk. ```