Data Center Liquid Cooling: Meeting AI Demands

Paul Milburn, Chief Product Officer at EkkoSense AI - data center liquid cooling management experts

By Paul Milburn, Chief Product Officer, EkkoSense

With data centres scaling up rapidly to meet accelerating AI, HPC, and big data analytics demands, there’s a huge requirement for data centre cooling solutions that can keep pace with the many ultra high-density AI racks that regularly consume upwards of 100-125 kW per rack.

And with these IT racks worth potentially upwards of £8-10 million each in terms of hardware and cooling equipment, data centre operations management teams are rightly concerned that they are cooled effectively and will continue to perform optimally under potentially challenging thermal conditions.

Traditional air cooling alone is clearly no longer enough to meet AI computing cooling requirements, and this has led to a surge in demand for data centre liquid cooling systems. Indeed, analyst firm McKinsey estimates that liquid cooling systems will account for almost half of the global data centre cooling market by 2030.

Integrating liquid cooling brings a number of clear thermal management benefits, including improved thermal conductivity and a heat capacity that enables the technology to first absorb and then transfer heat more effectively.

This enables operations teams to ensure lower overall operating temperatures. Given this efficiency, liquid cooling can deliver reductions in overall data centre cooling energy usage, while its more compact size supports increases in power density.

However, given that it’s not possible to run completely liquid-cooled data centres, the reality for most operators is that liquid cooling and air cooling will both still have an important role to play in the cooling mix – most likely as part of an evolving hybridised cooling approach.

We’re already starting to see a mix of data centre cooling options, ranging from traditional air cooling through enhanced air cooling options such as in-row, rear-door cooling, and high volume fan walls, to direct-to-chip and immersion liquid cooling. Not surprisingly, data centre operations teams are busy considering how their plans to accommodate higher density AI racks will impact their cooling solutions strategy going forward.

Key engineering questions still need answering

Selecting the optimum approach also requires a careful assessment of risk, performance and long-term costs. Whether it’s managing leak risks, recognising potential issues such as rising heat flux, or acknowledging the need to deliver a consistent thermal performance under demanding AI workloads.

Consequently, a number of key engineering questions still need answering before deploying liquid cooling – including analysing the optimised blend of air and liquid required for dynamic IT loads. Here are 12 key questions that data centre operations teams need to be asking:

What is the optimal hybrid liquid/air cooling mix across each of your rooms? How do you plan to keep this in sync?
What temperatures can your new AI compute engines safely operate at?
What steps have you taken to manage liquid cooling leak risks? Will you be able to pick up on potential leaks before they start to impact performance?
How are you addressing potential issues such as rising heat flux, especially with dynamic IT loads?
How do you know if your CDUs are delivering a thermally-uniform performance across their target racks running AI and HPC workloads? How do you establish the key points for the data monitoring of liquid cooling flows? How granular should liquid cooling monitoring be? What exactly are you looking to measure?
How are you monitoring the key chilled water assets that support your data sites? Do you have visibility of the holistic performance of the end-to-end system?
How do you balance chiller staging, flow rates and temperatures with your fluctuating IT load requirements?
How do you find the sweet spot where data collection works best for chillers in AI data centers?
Who actually ‘owns’ the control and, crucially, the ongoing configuration piece for your liquid cooling deployments?
How are you considering cooling pre-conditioning for the introduction of dynamic IT loads?
Are you considering feedback triggers from systems to back-off loads in the event of cooling anomalies with your liquid systems?

Getting the liquid cooling answers you need

Given the breadth of questions that still need answering, it’s entirely normal for data centre operations to have concerns when they’re investing significantly in expensive areas of risk such as Liquid Cooling.

The transition to hybrid air and liquid cooling to support AI computing needs careful design, planning, deployment and ongoing management. Placing new AI workloads demands smart capacity management, with careful consideration of space, power, and air & liquid cooling requirements. This increases the need for absolute real-time white and grey space visibility.

With liquid cooling deployments, there’s always an element of air cooling that can be anywhere between 15-30% of a liquid cooling installation depending on the configuration and CDUs installed.

So there’s a huge potential for optimisation of this balance to minimise the ratio. And because AI workload heat loads are so immense, you could actually end up using more air cooling than you ever did before the introduction of liquid cooling!

Without effective instrumentation you’ll never know if you have a problem

Successful liquid cooling deployment is all about understanding potential risks and having the right level of visibility into increasingly complex cooling installations.

When you’re investing hundreds of millions in AI data centres, lack of visibility is simply too big a risk for most operators.

Unless you effectively instrument all aspects of your cooling, then you’ll never know if you’re having, or are about to have, a problem. That’s why at EkkoSense we provide an AI-powered data center optimisation platform that allows operations teams to manage the real-time performance of all their air, liquid and hybrid-cooled environments.

Capabilities such as anomaly detection across airflow and liquid cooling can pick up on any flow-rate anomalies, while an embedded hybrid cooling advisor is the first to offer in-room optimisation of both hybrid air and liquid cooling.

Given the costs involved, it’s infinitely better to identify and resolve potential hybrid cooling issues before they start impacting availability.