Code red as data centres face thermal runaway during extreme heat
August 8, 2019 5:28 am Published by Leave your thoughts

by James Kirkwood, Head of Critical Services, EkkoSense

With the temperature hitting 38.1C in the UK at the end of July, and records also falling across the Netherlands, Belgium, Germany and France (where Paris recorded a high of 42.6C), climate experts are talking of 40C temperatures becoming the new normal for European summers.

While media attention focused on the impact of high temperatures on the rail network, and problems for both patients and staff in hospitals, these record temperatures have also brought with them a stark warning for the data centre industry.

With the extreme heat building from central France and sweeping up over South-East England, Belgium and the Netherlands, the resulting urban heat island effect prevented many cities from cooling. This proved a critical issue for operators in a number of Europe’s key data centre nodes, impacting London, Paris, Frankfurt and Amsterdam particularly.

What caught many operators by surprise was just how quickly thermal runaway can transform a data centre that was running fine into a site with real problems due to cooling plant failure. Where cooling systems and critically resilience have not been tested in anger, plant failures or reduced output caused by high ambients can quickly lead to high IT temperatures. So it’s easy to see how what was previously thought to be a well-operating site can quickly become well over temperature within hours, if not minutes.

As the UK’s ten hottest years on record have all occurred since 2002, it’s reasonable to assume that the extreme heat issue is going to become a more frequent occurrence. Data centre operators need to prepare for this as part of their risk planning, preparing not only for eventualities such as thermal runaway but also in terms of immediate action plans when a similar even occurs.

Given that cooling issues still account for almost a third of unplanned data centre outages, it’s increasingly important for operations teams to have access to the data required to effectively manage their thermal performance. EkkoSense, with our EkkoSoft Critical software and Critical Things sensor technology is ideally placed to equip organisations with a solution that can monitor, plan and optimise the critical data centre environment. 

This kind of visibility is particularly important before, during and after extreme temperature periods, giving teams precise data and real thermal insight so they can highlight unforeseen issues before they occur, track actual performance in real-time when they do and finally truly assess site performance after an event so site can be improved for next time. A simple, ‘we were ok this time’ is not enough. Monitoring cooling performance is clearly essential if you’re going to find out exactly how your rooms perform when the weather gets really hot.

From what we’ve seen over the last couple of weeks there are some clear lessons that organisations can learn. We’ve divided them into two key areas: What you can do to prepare your data centre for extreme heat events, and What immediate steps can you take if you’re facing thermal runaway.

What can you do to prepare your data centre for extreme heat events?

  1. Pre-empt potential thermal failures– with real-time thermal monitoring in place you can track cooling performance and can identify any poor-performing cooling systems in advance so timely improvements can be made. In our experience, granular rack and CRAC level monitoring always finds hidden, but easy to fix, cooling and airflow problems that typical cooling PPMs and BMS system fail to find or diagnose. Rack monitoring, cooling utilisation and Zone of influence analytics are critical.
  • Make sure you’re continually optimising your data centre’s operational performance – Data centre teams need to be continually optimising their site performance to make sure it is fully resilient, and that their resilient capacity is ready to go and not wasted. Ongoing monitoring ensures that teams get to learn their rooms, highlighting any poor performing areas (hot spots) and imbalances well in advance. These small imbalances, often hidden by overcapacity, become major issues or catalysts when extreme events occur.
  • Be prepared with true Testing – You’re not fully prepared unless you’re actively running comprehensive test scenarios and collecting and analysing performance data to identify and develop tactics for handling issues in advance. Simulation is not enough, procedures, BMS system, alarms etc. must all be live tested in a controlled (risk managed) manner to make sure they react accordingly and as expected when an extreme event happens.
  • Become more Agile and Adapt – Data centres are continually evolving, so it’s important to adapt your cooling, capacity and power strategies to meet changing requirements. Rebalancing rooms should be part of your daily tactics, and full performance reviews should be undertaken post high ambient activity – even when no alarms or SLAs were breached – to understand how your DC reacted. If any anomalies are picked up, these can then be resolved before the next time. Any data collected here can also be applied to help future risk planning.

Immediate steps to take when facing thermal runaway

  1. Understand exactly where the crisis hit– if your data centre is overheating you need to be able to identify precisely what’s happening and take immediate steps to address your most critical thermal areas. Only real-time granular rack and Cooling monitoring can help you here. Cooling equipment, such as CRACs may not be alarming but can still be poorly performing.
  • Focus Engineering attention – giving your on-site engineering resource specific tasks is vital, so it’s important to have rack-level insight into what needs addressing first. The kind of real-time granular thermal data tracked by EkkoSoft Critical will provide this guidance, for example by identifying a particular cooling unit that requires urgent attention.
  • Know exactly where to apply cooling – being able to pinpoint hotspots means that any resilient or emergency cooling resource you have can be applied where it’s most needed, taking out the guesswork at a critical time.
  • Visualise your Airflows – very few rooms have thermal airflows that are 100% optimised, but this can become even more critical during an extreme heating event. Simply using a software-only solution such as the Cooling Advisor capabilities with EkkoSoft Critical can help improve your thermal performance – a potentially critical differentiator during the hottest weather 

EkkoSense is the leading thermal expert for Data Centres worldwide and we are ready to mobilise our tools and expertise to assist our clients to identify & prevent issues or, if needed, solve them when they occur. To understand more about how tools such as EkkoSoft Critical and its Cooling Advisor service can help you reduce the risks associated with potential thermal failure, contact me James Kirkwood at james.kirkwood@ekkosense.com or book a demo of EkkoSoft Critical Cooling Advisor here www.ekkosense.com/demo

Tags: , ,

Categorised in: ,

This post was written by Cheryl Billson

Leave a Reply

Your email address will not be published. Required fields are marked *





© 2019 Copyright EkkoSense.