According to a survey from Statista, 80% of enterprises had deployed a hybrid cloud in their organisation as of March 2022. As the adoption of hybrid cloud environments and technologies continues to grow, it is becoming increasingly challenging for IT operations teams to keep pace with the complexity and sheer volume of data that digital systems tend to generate, says Joseph George, VP of product management for DSOM at BMC.
In today’s modern world, many customers look to ingest millions of metrics per second from multiple tools. However, without the power of artificial intelligence for IT operations (AIOps), there is no practical way to scale operations teams so they can effectively handle that volume of data. The stakes continue to rise, too. According to ITIC’s 2022 Global Server Hardware Security survey, 91% of SME and large enterprises agreed that a single hour of downtime can cost more than $300,000. Overall, 44% of mid-sized and large enterprise survey respondents reported that a single hour of downtime can potentially cost the businesses more than one million dollars.
Following this, it should come as no surprise that AIOps is a hot topic and expected to remain this way for the foreseeable. Organisations that implement AIOps correctly will free up the time of skilled employees so they can work on innovative projects. In the meantime, artificial intelligence (AI) and machine learning (ML)-powered software can handle the increasing volume of metrics, events, and logs, all while ensuring the business continues to operate smoothly. However, similar to other enterprise software efforts, rolling out AIOps without a solid plan in place is not a recommended path for success. To succeed, organisations can look to a number of AIOps use cases.
Anomaly detection
The first use case to address is anomaly detection.AI-powered advanced anomaly detection finds outliers in data, which can help to dynamically baseline services. The behaviour of the system automatically sets thresholds for the generation of events. Essentially, advanced anomaly detection typically involves multivariate algorithms and can adjust automatically for any system behaviours that it learns over time. With the resulting insights, IT operations teams can monitor the systems more intelligently, with alerting thresholds automatically adapted to the normal behavioural characteristics of systems.
Event correlation
In addition, AIOps can reduce the noise of myriad events across an environment by breaking down data silos, and ingesting data in the form of logs, events, traces, and metrics. Advanced AIOps technologies can correlate events along multiple dimensions of time, text, and topology. This helps to eliminate noise such as duplicate and dependent events, as well as aggregating multiple underlying events into higher-level situations.
Root cause isolation
Understanding the root cause of an issue requires an accurate view of the relationships between different elements in the organisation’s environment. By leveraging topology-enhanced, knowledge-graph-based AI/ML, root causes can be identified more accurately, which, in turn, reduces the time it takes to detect the source of a problem. By applying this type of advanced analysis to operational metrics across infrastructure and applications, AIOps can zero in on the true problem. This saves IT teams time and energy which could be better spent elsewhere, while also reducing operational costs to the business.
Intelligent automation and remediation
While reducing event noise and finding the root cause of issues is incredibly valuable, it ultimately comes down to taking the remediation action to fix the problem. Modern AIOps solutions can support automated remediation actions to be taken in response to issues, ideally being able to integrate with a broad range of automation platforms and tools.
As operations teams become more comfortable with automation based on the historical success of remediation, they are able to define policies so that those actions are automatically based on the root cause detected. Over time, AIOps can learn how successful automation has been in different situations to proactively recommend automation opportunities.
Predictive insights
AIOps can take IT operations to the next level by looking ahead, predicting potential issues, and taking corrective measures before they even happen. This includes identifying resource saturation and capacity limit situations, by projecting organic growth of a system and learning from past behaviours. This enables IT operations teams to identify actions such as provisioning additional capacity or resources before any issues occur.
AIOps systems can also look at historical patterns in data and identify where a system failure or degradation in performance is expected to happen. This type of real-time predictive alerting saves the team from potentially reacting to a problem and instead enables the business to prevent service outages from happening at all. Essentially, businesses cannot deliver digital experiences on the front end without also putting the right tools in place to digitally transform the back end.
Essentially, AIOps will allow IT operations to support increasingly digital businesses by intelligently analysing large volumes of data, learning system behaviours, and automatically recommending actions.
However, despite the benefits, many organisations have a long way to go when it comes to implementing AIOps. In fact, according to another survey from Statista, only 29% of respondents were exploring AIOps in 2022, while 26% believe AIOps is not strategic for their organisation. This suggests that many are at the early stages of embracing digital transformation or becoming an autonomous digital enterprise, but this will change in the years ahead. The sheer complexity enterprise IT operations, and the volume of data produced each day, mean that AIOps will soon become an operational necessity. The enterprises that embrace it early benefit from an immediate advantage and a head start on the market.
AIOps has the potential to work both proactively to prevent system failures, as well as in response to rapid root cause isolation for issues that could not be prevented. Focusing on these use cases will enable organisations to embrace new application architectures and increasingly complex, hybrid ecosystems. It also ensures IT operations can keep pace with the needs of the business and evolving customer demands.
The author is Joseph George, VP of product management for DSOM at BMC.