The future of AIOps with Agentic AI as part of Operations

[ad_1]

What already seems like many years ago, I wrote a blog about AIOps – What it could do and how one could implement it –> What is AIOps and why should I care? – msandbu.org

With all the advancements in GenAI, it’s a great time to revisit the topic, especially as AIOps—how we integrate AI into IT Operations—is set to undergo significant changes.

While AIOps was historically about

A datalake that was collecting metrics and events from different monitoring systems and the IT infrastructure
Using ML algorithms to find anomalies and identity root cause analysis, and also using this type of data to build a predictive engine that would identify issues before they even happened based upon data stored in the datalake.

While this sounds like a utopia, a few of these vendors in the market were able to successfully build a product that could do this. The main issue was that they needed to support a wide range of different tools and products to be able to do it, or they would need to have a product that was able to integrate with the entire infrastructure.

Most of these vendors where placed into two categories, those that provide it as part of their own monitoring systems such as

Dynatrace, Splunk or Datadog – For instance Splunk ITSI has a engine that can do noise reduction of incidents by using ML, but also be able to do RCA based upon collecting metrics and events from different sources.

Or those that provide a product which acts like a hub where monitoring tools plug into their system, such as

Big Panda, Opsramp and Moogsoft

However we have also seen ITSM based approach with vendors such as Service Now, which act like they try and solve everything.

The issue today is that most services that an IT department try to manage is a combination of many platforms and cloud services. IaaS, PaaS, SaaS, self-hosted, private cloud and the application landscape is pretty complex. Meaning that in order to find the «needle in the haystack» if something is not working is becoming more and more difficult.

This means that doing a traditional AIOps approach with the datalake and ML algorithms is suddenly even more difficult to implement. However I feel now that Agentic AI is going to become a bigger part of IT Operations.

Where can Agentic AI help out in IT operations? (Also visualised below)

Service Desk (1.line support) – a virtual assistant combined with RAG that can answer questions and even create detailed incidents based upon descriptions or pictures. We can also have assistants that can follow-up old incidents ensuring that old/stale incidents are still pending. These assistants can also be voice activated allowing end-users to call directly and have a conversation with them regarding issues.
Knowledge Article assistant – virtual assistants can that create knowledge articles or improve existing ones based upon incidents. We can also have an assistant that creates knowledge articles based upon changes done using video recordings as a source
Endpoint Assistants. Where you have a virtual assistant running locally on a users device and combination with RAG and MCP can troubleshoot directly and even create incidents without the user even knowing. Using RAG can also give the user help if there are persistent issues happening. These agents can also notify the user or on-going issues posted by the IT-department. These will rely on locally installed LLMs such as Phi-4 or Deepseek that is used by for instance Copilot for Windows.

IT Operations Assistants – Custom assistants working within a platform such as Kubernetes, Cloud or virtualization enviroment. Such as Microsoft has built a Azure SRE Agent that can troubleshoot PaaS services in Azure, we will see many of these agents that a customized and running within different enviroments. Always running and checking for misconfiguration or errors and using the tools available to them to fix or determine issues. These agents can also create incidents.
Service Desk Assistants – These are only focused look at incidents and trying to do RCA, such as doing web search or trough traditional RAG to find solutions to existing issues. They can also look at multiple issues and determine if there is a bigger problem that impacts the IT services.

While some of these examples are possible and already running in production, others are still a bit away. As an example, Microsoft is now introducing MCP protocol directly into Windows, allowing it to call 3.party services as a native part of the operating system. This has still not been released yet, but when it comes and hopefully with an agent framework will make it easier to build out an endpoint agent.

These agents will come in many different shapes and forms, some of them will be out-of-the box from different vendors, such as ITSM agents from Service Now (called Now Assist) some from the cloud providers such as Azure SRE Agent and even one for Kubernetes. While some will be custom built using one of the agent frameworks (Langgraph, n8n, Copilot Studio and so on)

Agentic AI can save time on resolving issues, but it struggles with performing RCA on problems requiring debugging of extensive logs and metrics, as LLMs currently lack the capacity to handle large datasets effectively.

It’s important to note that Agentic AI or integrating AI into operations won’t replace skilled engineers. Instead, it will help them stay focused and save time on “noise,” considering the IT landscape is quite complex and made up of many different layers and technologies.

Hopefully, this will ensure that issues are resolved more quickly and assist the IT team by reducing the time spent on routine tasks.

[ad_2]