Under the Hood: How Two-Phase, Direct-to-Chip Liquid Cooling Works

By Shahar Belkin, Chief Evangelist at ZutaCore.

As next-generation AI superchips approach 2,800 watts and beyond, the industry is going to see the wide-spread emergence of 50-100 megawatt AI factories that are powerful enough to handle AI workloads – but hotter than ever.  Liquid cooling, or a combination of liquid cooling and air cooling, will be required to keep these facilities cool, with several different types of liquid cooling data centers hyperscalers can choose from.

As the graphic below illustrates, there are two main categories of liquid cooling. The most commonly used method is known as Direct-to-chip cooling, which uses small, compact cold plates filled with liquid that is either water, a water glycol mix, or a heat transfer fluid. These cold plates are placed directly on top of the GPU or CPU, in place of traditional heat sink.  

The other category of liquid cooling is known as Immersion cooling. This method of cooling uses large heavy tanks filled with either an oily liquid or a low boiling point, dielectric fluid. The servers and other IT equipment are literally ‘immersed’ directly into this liquid, hence the name.

 

Caption:  Two main categories of liquid cooling are ‘immersion’ or ‘direct-to-chip’ and each has a single-phase or two-phase option. 

It is widely agreed in the industry that direct-to-chip liquid cooling is the most effective technology for scaling in the future and being able to handle the added heat from AI factories.  Once a data center makes that decision to go direct-to-chip, it’s an easy decision on whether to use a single-phase or two-phase approach.  Single-phase uses water or a glycol water mix, which presents a risk of catastrophic damage in the event of leaks and hyperscale don’t want this risk.  Two phase, direct-to-chip, on the other hand, uses no water in the cold plate, and leverages a heat transfer fluid that is safe to IT equipment.

How Direct-to-Chip Liquid Cooling Works

As you can see in the image below, two-phase, direct-to-chip liquid cooling consists of a simple cold plate design that sits on top of the CPU or GPU.  Using a cold plate means that you don’t have to change the server and rack design. It only involves replacing the air based heat sink with a cold plate.

 

Caption:  Two-phase, direct-to-chip cooling uses no water in the cold plate

Inside the cold plate is a pool of heat transfer fluid and when heat is generated from the chip, the liquid begins to boil and the heat turns into vapor.  The liquid always remains at a consistent boiling temperature, regardless of chip power, ensuring predictable thermal performance, allowing this cooling method to be  scalable and able to cool hotter and hotter chips as they become available.  

As can be seen in the image below, this process is similar to the way boiling water keeps the bottom of a pot at 100⁰C, only in this case at a lower temperature. As the liquid inside the cold plate boils, the liquid in the cold plate never passes the boiling temp. This makes this technique highly scalable for cooling higher power “hotter” chips of the future. 

Caption:  It does not matter if you turn the heat up 3X on a ZutaCore cold plate because the liquid will always stay at boiling temperature, requiring no new equipment or infrastructure change

To see an instructional video showing how the two-phase pool boiling approach works graphically, click this link.

So What Happens to the Bubbles?

While pool boiling has always been the holy-grail of liquid cooling, up until now, no one has been able to figure out how to prevent the boiling bubbles from causing hot spots.  To overcome this, ZutaCore developed a structure of fins and wick with a material that is porous like a sponge located between the fins (see image below). The liquid is soaked inside the wick and the bubbles occur between the wick,  liquid and the fins. This method prevents bubbles from being formed on the surface and maintains uniform cooling.

 

Caption: Hot spots are eliminated through novel use of wicks and fins

Looking at the Whole System: Cold Plate, Manifolds, Heat Rejection Units

While two-phase, direct-to-chip cold plates sit on top of the processors, the system also features manifolds that act as the distribution of the liquid and vapor to the system in a closed loop system that sends heat into a heat rejection unit. Together, these components deliver one of the most cost-effective, sustainable cooling mechanisms that is also easy to install and requires low to zero maintenance over time.

 

Repurposes Heat into Valuable, Reusable Energy.  

Another advantage of two-phase, direct-to-chip liquid cooling is that as all the heat is now transferred into vapor in a tube, the heat can be transferred to be re-used for other purposes such as heating the data center or even nearby buildings and neighborhoods. This delivers a level of sustainability that is simply not possible with alternative cooling solutions.

 

Caption:  Two-phase, direct-to-chip cooling allows heat to be reused for other purposes

A New Data Center Architecture for AI Factories

Munters is a global leader in energy-efficient air treatment and climate solutions and they recently announced a partnership with ZutaCore to deliver a new waterless data center architecture capable of sustainably cooling the massive power densities being driven by AI accelerators.

This solution is designed to liquid cool 100s of Megawatts of AI Workloads.  Munters has integrated the HyperCool closed-loop system at the server and rack level with the Munters closed-loop system that provides the ability to remove heat from the data center without a facility water loop. Leveraging a two-phase, liquid cooling process, this system features condensers on the roof that condense the two phase to liquid using dry coolers and ambient air, the liquid is brought down on demand to cool the GPUs and CPUs by gravity. The heat is removed outside the facility or can be reused for other applications, such as heating adjacent facilities or harvesting energy for reuse and further reducing PUEs.

To see a video that walks through this new architecture, click here.  

 

Caption:  Heat is transported from ZutaCore pool-boiling cold plates directly to Munters SyCool heat rejection condensers, eliminating the need for intermediate heat exchange devices. 

How PUE Impacts Your Choice of Liquid Cooling

PUE is a widely used metric in the industry that helps measure how data centers are using their energy. This metric is a ratio between the total energy amount a facility consumes and the energy specifically used by the IT equipment. 

The industry as a whole has been pretty much stuck at around 1.5, which means that 1/3 of data center power is being used for cooling, lighting, and other systems.  That means that if an AI factory is pumping in 15 megawatts, only 10 megawatts is going to compute and 5 megawatts is going to overhead including what is cooling that compute. This number is only going to get more important with the transition to 100 megawatt AI factories, which is why countries such as those in Europe are starting to institute new regulations requiring data centers to measure and report their PUEs in an effort to reduce their environmental impact. In fact, all European data centers larger than 500kW will soon be required to report factors such as floor area, installed power, data volumes, energy consumption, PUE, temperature set points, waste heat utilization, water usage, and use of renewable energy. This information will then be used to provide a basis for transparent and evidence-based planning and decision making by member states and the Commission and assess certain key elements of a sustainable data center.

Key Takeaways on Two-Phase, Direct-to-Chip Liquid Cooling

Below are just some of the advantages that companies can benefit from when using two-phase, direct-to-chip cooling.

Eliminates the Massive Amount of Water Required to Cool GPUs and CPUs.  A 100-megawatt data center can use approximately 1.1 million gallons of water every day.  Two-phase, direct-to-chip liquid cooling technology reduces data center water consumption in several ways.  First, the cold plates do not use water at all. The dual phase pool boiling approach uses a heat transfer fluid to evacuate the heat from the chips via liquid to vapor phase change.  This system can utilize liquid to air heat exchange, liquid to liquid (using a primary water loop if available), or even a thermosyphon approach that would eliminate massive amounts of data center water use altogether. 

High Thermal Design Power (TDP) – Supports 2800W and above TDP in a compact and environmentally friendly, scalable design.

100% Heat Reuse – Provides constant and high output water temperature (70 ℃) and 30-40% less energy for heat reuse applications. 

Superior Power Usage Effectiveness – Achieves as low as 1.04 PUE, delivering 10-20% better energy efficiency with dynamic cooling, smaller pumps, and no performance degradation over time.

Higher Server Densification – Up to 50% less space is used in an air-assisted liquid-cooled datacenter and up to 75% less space than immersion cooling.

Continuous Operation in Case of Dielectric Fluid Leak – Non-conductive, non-corrosive dielectric fluid ensures no damage and continuous operation in case of a leak, compared to water-based technologies, where leaks could cause significant server damage and outages.

Lowest Maintenance – The quality and amount of the dielectric liquid in HyperCool stay the same after many years of usage. Since no water is used, the system is free from corrosion and water-related threats such as mold.

Ideal for Chiplet Architectures – Unique design automatically maintains different temperatures at different locations which is key for AI servers leveraging the latest chiplet architectures.

The Future:  Taking Water out of Liquid Cooling

There is no question that the future of data centers and AI factories will require some level of liquid cooling to bring the heat down. Two-phase, direct-to chip liquid cooling uses dielectric liquid, which not only eliminates the risk of water leakage, but also saves this scarce resource for what it’s really needed for globally, which is drinking!  And when compared to liquid cooling solutions on the market, this method of cooling delivers unprecedented benefits in cost, sustainability and scalability.  It is for these reasons why the ecosystem is growing so fast around this technology and this is what will drive true AI sustainability to the masses in the future.

By Javier Cavada, President & CEO EMEA, Mitsubishi Power.
By Terry Storrar, managing director, Leaseweb UK.
By Joyce Fong, Counsel, and Bryan Tan, Partner, at Reed Smith.
In the face of growing concern surrounding the energy demands of Artificial intelligence (AI), Ben...
By Mark Pestridge, Executive Vice President and General Manager, Telehouse Europe.
By James Vaughan, Manager at BCS, the digital built asset consultancy.
By David Ferguson, cyber development lead and head of data at ScotlandIS.