Why data centres need much better early warning systems to reduce thermal risk this summer
July 15, 2020 8:29 am
By James Kirkwood, Head of Critical Services, EkkoSense
Already 2020 is looking like it’s going to be one of the top five hottest years on record, with temperatures easily matching last year’s summer heatwave highs.
While that’s great for people wanting to share time outdoors with family and friends after a tough few months, the fact that the UK was hotter than the Bahamas last month should be ringing alarm bells for data centres. Peak temperatures inevitably bring thermal challenges, and as the UK’s ten hottest years on record have all occurred since 2002, data centre cooling strategies clearly need to be ready for whatever summer brings.
Given that cooling issues still account for almost a third of unplanned data centre outages, mitigating the impact of escalating temperatures needs to be part of any data centre’s risk planning approach. Unfortunately, most organisations still seem unaware of the thermal risks that can quickly place their data centre operations in danger. Or perhaps they think they’re on top of the issue because they already have a solid BMS in place? With thermal issues now ranking as the second largest cause of data centre loss of service, it’s critical for organisations to reduce this risk by optimising their thermal performance.
Identifying the early warning signs
Even the most experienced data centre operations teams can be surprised at just how little time it takes for thermal issues to flare up. Cooling plant failure can easily escalate into a thermal runaway situation in no time, transforming a room that’s operating normally into a site that’s got real problems.
At EkkoSense we’ve found that one of the key reasons for this is that incumbent solutions such as a BMS are not very effective at spotting thermal challenges in time. Thermal issues such as cooling, and airflow problems typically don’t trigger BMS alerts early enough as there is no hard SLA breach or fault. And when they do, it’s often too late to prevent SLA breaches. The result is that thermal issues can quickly escalate, creating localised data centre hot spots that impact overall performance before the operations team has a chance to resolve the problem.
Don’t wait for alerts – taking a more proactive approach
That’s why pre-empting potential thermal failure is so important. By applying the latest AI and machine learning technologies, affordable software solutions now exist that can work in parallel with BMS systems to identify and manage away thermal risk from any data centre.
With this kind of real-time thermal monitoring in place you can track cooling outputs and identify any poorly performing cooling systems in advance so timely improvements can be made. Granular rack and CRAC level monitoring is essential here for finding the hidden – but easy to fix – cooling and airflow problems that typical cooling PPMs and BMS systems fail to find or diagnose.
Thanks to ongoing development of our EkkoSoft Critical monitoring system, we’re now able to complete remote thermal risk prediction analysis of critical sites. In one recent example, our software and analytics capability was used to remotely identify abnormal thermal behaviour, remotely diagnose the issue and recommend how its effect could be immediately mitigated. All this before any BMS system has even picked up the issue.
This video provides an illustration of how a predictive analytics-based approach can equip data centres with the early warning capability that’s needed to prevent outages. In this example a data centre with a normal and steady cooling duty profile very quickly became thermally unstable thanks to a failing CRAC unit. The timeline was as follows:
- EkkoSense software analytics initially identified abnormal CRAC behaviour, drawing on performance data from its EkkoAir cooling duty sensor in the CRACs.
- Software identifies the individual ‘poor performing’ CRAC
- Software provided early warning of a localised hot spot due to the CRAC issue
- Software analytics also showed that other local CRACs, while still functional, were unable to remove the hot spot
- Software recommended switching off the failed CRAC to remove recirculating hot air. Once actioned, the hot spot issue was immediately resolved
- CRAC issue investigated and resolved, normal thermal operation was restored and confirmed by the software
At no point during this sequence did the incumbent BMS generate an alert as there was no specific component failure or alert threshold that was breached. This example shows how the software’s early risk detection analytics was able to identify and diagnose poor-performing cooling plant before its ultimate failure, so that any potential thermal risk was removed, and repairs could be planned in good time. It also illustrates how the lack of an alert generation by the BMS means that without additional predictive analytics the data centre team would have remained unaware of the issue or where to look. By looking at the data centre room as a whole, the EkkoSoft Critical analysis software was able to pick up on subtle changes – such as a set point change, a stuck valve or a grille that’s been moved – that could potentially lead to a broader thermal issue.
Early warning before boiling over
Unlike traditional BMS approaches that only generate alerts when a system has failed or a threshold breached, EkkoSense’s blend of highly granular sensing and EkkoSoft Critical real-time algorithms can highlight potential equipment failures before they have a chance to impact data centre service availability.
It’s only by removing 100% of thermal risk from their data centre operations and providing a stable platform for subsequent cooling optimisation projects, that data centre managers can truly achieve ‘thermal piece of mind’. And, thanks to EkkoSoft Critical’s real-time insights – as opposed to the theoretical modelling provided by other approaches – it’s possible for data centre teams to unlock other initiatives such as a transition to more condition-based monitoring. This could deliver significant savings by moving away from more complex, ineffective and expensive maintenance schedules.
To understand more about how you can minimize your exposure to summer temperature peaks – while also providing remote thermal monitoring – contact me James Kirkwood at [email protected] or book a demo here www.ekkosense.com/demo
Categorised in: Uncategorized
This post was written by James Kirkwood