Code red as data centres face thermal runaway during extreme heat

James Kirkwood
Head of Technical Sales

Jul 19, 2022







With the UK recording temperatures above 40C for the first time today, UK and European data centre operators need to adjust to 40C+ temperatures being much more likely for European summers.

What can catch many operators by surprise is just how quickly thermal runaway can transform a data centre that was running fine into a site with real problems due to cooling plant failure. Where cooling systems and critically resilience have not been tested in anger, plant failures or reduced output caused by high ambients can quickly lead to high IT temperatures. So it’s easy to see how what was previously thought to be a well-operating site can quickly become well over temperature within hours, if not minutes.

As the UK’s ten hottest years on record have all occurred since 2002, it’s reasonable to assume that the extreme heat issue is going to become a more frequent occurrence. Data centre operators need to prepare for this as part of their risk planning, preparing not only for eventualities such as thermal runaway but also in terms of immediate action plans when a similar even occurs.

Given that cooling issues still account for almost a third of unplanned data centre outages, it’s increasingly important for operations teams to have access to the data required to effectively manage their thermal performance. EkkoSense, with our EkkoSoft Critical software and Critical Things sensor technology is ideally placed to equip organisations with a solution that can monitor, plan and optimise the critical data centre environment.

This kind of visibility is particularly important before, during and after extreme temperature periods, giving teams precise data and real thermal insight so they can highlight unforeseen issues before they occur, track actual performance in real-time when they do and finally truly assess site performance after an event so site can be improved for next time. A simple, ‘we were ok this time’ is not enough. Monitoring cooling performance is clearly essential if you’re going to find out exactly how your rooms perform when the weather gets really hot.

There are some clear lessons that organisations can learn and we’ve divided them into two key tip areas: What you can do to prepare your data centre for extreme heat events, and what immediate steps can you take if you’re facing thermal runaway.

What can you do to prepare your data centre for extreme heat events?

  • Pre-empt potential thermal failures– with real-time thermal monitoring in place you can track cooling performance and can identify any poor-performing cooling systems in advance so timely improvements can be made. In our experience, granular rack and CRAC level monitoring always finds hidden, but easy to fix, cooling and airflow problems that typical cooling PPMs and BMS system fail to find or diagnose. Rack monitoring, cooling utilisation and Zone of influence analytics are critical.

  •  Make sure you’re continually optimising your data centre’s operational performance – Data centre teams need to be continually optimising their site performance to make sure it is fully resilient, and that their resilient capacity is ready to go and not wasted. Ongoing monitoring ensures that teams get to learn their rooms, highlighting any poor performing areas (hot spots) and imbalances well in advance. These small imbalances, often hidden by overcapacity, become major issues or catalysts when extreme events occur.

  •  Be prepared with true Testing – You’re not fully prepared unless you’re actively running comprehensive test scenarios and collecting and analysing performance data to identify and develop tactics for handling issues in advance. Simulation is not enough, procedures, BMS system, alarms etc. must all be live tested in a controlled (risk managed) manner to make sure they react accordingly and as expected when an extreme event happens.

  • Become more Agile and Adapt – Data centres are continually evolving, so it’s important to adapt your cooling, capacity and power strategies to meet changing requirements. Rebalancing rooms should be part of your daily tactics, and full performance reviews should be undertaken post high ambient activity – even when no alarms or SLAs were breached – to understand how your DC reacted. If any anomalies are picked up, these can then be resolved before the next time. Any data collected here can also be applied to help future risk planning.


Immediate steps to take when facing thermal runaway

  • Understand exactly where the crisis hit– if your data centre is overheating you need to be able to identify precisely what’s happening and take immediate steps to address your most critical thermal areas. Only real-time granular rack and Cooling monitoring can help you here. Cooling equipment, such as CRACs may not be alarming but can still be poorly performing.

  • Focus Engineering attention – giving your on-site engineering resource specific tasks is vital, so it’s important to have rack-level insight into what needs addressing first. The kind of real-time granular thermal data tracked by EkkoSoft Critical will provide this guidance, for example by identifying a particular cooling unit that requires urgent attention.

  • Know exactly where to apply cooling – being able to pinpoint hotspots means that any resilient or emergency cooling resource you have can be applied where it’s most needed, taking out the guesswork at a critical time.

  • Visualise your Airflows – very few rooms have thermal airflows that are 100% optimised, but this can become even more critical during an extreme heating event. Simply using a software-only solution such as the Cooling Advisor capabilities with EkkoSoft Critical can help improve your thermal performance – a potentially critical differentiator during the hottest weather


EkkoSense is the leading thermal expert for Data Centres worldwide and we are ready to mobilise our tools and expertise to assist our clients to identify & prevent issues or, if needed, solve them when they occur. To understand more about how tools such as EkkoSoft Critical and its Cooling Advisor service can help you reduce the risks associated with potential thermal failure, contact me James Kirkwood at [email protected] or book a demo of EkkoSoft Critical Cooling Advisor here www.ekkosense.com/demo


EkkoSense solutions are available directly, or through the company’s expanding network of international partners across North and South America, Latin America, Europe, the Middle East and Asia Pacific.



www.ekkosense.com  
Follow us on Twitter @ekkosenseUK

Press Contact: Cheryl Billson, Comma Communications – PR for EkkoSense,
+44 (0)7791 720460
[email protected]