AI & HPC Data Center Challenges in Financial Services
“Untying the Gordian Knot of Data Center Challenges Facing Financial Services Companies
In the Era of AI/HPC”Steve Lewis, VP of Sales for the Americas at EkkoSense
Data center leaders at financial services companies are facing an increasingly complex set of operational challenges as their organizations make major investments in AI/HPC and as the number of applications they support skyrockets. Because of the way these operational challenges are inter-connected, data center teams often feel like they are facing an impossible conundrum where attempting to solve one challenge will only make another challenge worse. It brings to mind the Gordian Knot of ancient Greek myth, where pulling a string in one part of the tangle makes the knot even tighter and harder to solve.
A perfect example of this is the conflicting pressures for uptime vs. two simultaneous trends: rapid expansion of AI/HPC infrastructure and a massive proliferation of IT applications the data center team must support. In every financial services company, there is intense internal pressure and regulatory pressure on data center teams to ensure impeccable uptime and availability. But at the same time, data center teams are being asked to support dozens or hundreds of new applications as well as rapidly expanding AI/HPC computing. The volume of new applications and the higher operational complexity of AI/HPC both create escalating risks of downtime that is in conflict with the expectation of impeccable uptime. These conflicting pressures put data center teams in a very difficult position.
The same is true for energy efficiency and AI/HPC. Data center teams at financial services companies are being asked to show demonstrable progress toward ambitious, publicly-reported objectives related to energy efficiency and sustainability. However, there is also intense pressure to expand high-density, energy-intensive AI/HPC environments as well as ever-increasing enterprise/cloud infrastructure. Once again, these two objectives are in direct conflict with one another.
And all of those dynamics above are happening against the same Gordian-like backdrop that data center teams have faced for years and years in the financial services industry: How to support more and more infrastructure, with aging equipment and management tools (e.g. DCIM and BMS tools), and without enough people on the team to do everything that needs to be done.
This knot of conflicting pressures existed before the AI era, but the rapid adoption of AI/HPC in the financial services industry has made the knot feel tighter and tighter, leaving no room for data center leaders to solve one challenge without opening up significant risk elsewhere.
How was the original Gordian Knot solved? Alexander the Great rejected the premise that there was no way to solve the puzzle while making it intractably worse. He cut through it with a sword, eliminating the knot with a single stroke. Data center teams may not have a sword to solve these conflicting challenges, but there is a way to successfully untie this increasingly complex knot without worsening any of the individual challenges
To do this, data center teams need to start by increasing visibility into their data center infrastructure. Traditional DCIM, BMS and other data center management tools prevent even the most vigilant data center team from having truly comprehensive, granular, real-time visibility into thermal and power systems. Enhanced visibility would enable organizations to mitigate emerging threats of downtime and dramatically increase energy efficiency—all while doing so at scale as AI/HPC and enterprise/cloud infrastructure rapidly expands.
There are thousands of hidden risks and inefficiencies in any organization’s data centers, caused by the gaps inside and in between traditional DCIM, BMS and other systems. These gaps hide emerging risks for downtime and also hide opportunities for increasing energy efficiency.
Temperature monitoring is one example of these gaps. Traditional data center monitoring tools only measure the temperature of the cooling unit. They provide no data about a much more important measurement: rack inlet temperatures, which is a much more important predictor of emerging thermal issues that could escalate into downtime incidents. Monitoring these inlet temps also reveals opportunities to fine-tune cooling in real-time to optimize energy consumption and drive efficiency.
Comprehensive visibility is a key step because it gives data center teams the insights they need. The second step is to harness AI-driven continuous optimization to turn that data into results. Taking these steps can eliminate up to 40% of risks for downtime before they escalate into incidents, even with rapidly-expanding AI/HPC infrastructure. And these steps can also increase energy efficiency up to 30%, even as organizations roll out more high-density AI/HPC infrastructure.
Truly comprehensive visibility combined with machine learning and AI-driven continuous optimization, data center teams can begin solving challenges that were previously in direct conflict with one another. An organization can now expand their high-density AI/HPC computing systems while also driving significant gains in energy efficiency that supports ESG goals. And an organization can now support the higher-risk operating environments of AI/HPC data centers while also mitigating risks in a proactive way across every rack in every data hall. And data center teams can ease the pressure on their limited teams and resources with AI-driven optimization that acts as a force multiplier for their teams.
These conflicting pressures and challenges on the data center teams of financial services companies are complex, but this strategy of increasing visibility and leveraging automation gives data center leaders a chance to turn a previously impossible knot into one that can be successfully untied.
About the author:
Steve Lewis is the VP of Sales for the Americas at EkkoSense, a leader in the provision of advanced sensing technology, SaaS DCIM-class visualization & monitoring software, and AI-powered analytics solutions for critical facilities such as data centers. Lewis possesses more than twenty years of experience in critical infrastructure and data center innovation. He is particularly renowned for his expertise in advancing intelligent, energy-efficient solutions that strengthen operational resilience across global markets.