Implementing AIOps with Redapt and AWS

Josh Dougherty (00:11):

Hello and welcome to our panel discussion on AIOps. I'm your moderator, Josh Dougherty. Joining me today are four experts on this subject from both Redapt and AWS. First, we have Rizwan Patel from Redapt, a Field Chief Technology Officer. We also have Jerry Meisner, Practice Lead of Modern Data Center and DevOps. From the AWS side, we have Lucy Hartung, Senior Business Development Manager and Bobby Hallahan, who's a Senior Solution Architect for Observability. Well, let's dive in and start the conversation. Rizwan, I'd like to send the first question over to you. Can you talk about what AIOps is and why organizations should consider adopting it?

Rizwan Patel (01:11):

Sure. Thanks, Josh. AIOps, which stands for Artificial Intelligence for IT Operations, is an emerging technology and the definition, like the space, is fluid. I like to think about it as, AIOps is a technology which combines the characteristics of artificial intelligence, AI, and machine learning (ML) to automate. That's the key, automation, data correlation, enable root cause analysis, and deliver insights, insights being both predictive and prescriptive, across the entire enterprise ecosystem. This is done by accessing and combining cross application data and leveraging it to create a baseline, a customized, personalized self-learning model, and using that model to deliver insights such as real time anomalies, proactive, filtered, actionable alerts, minimizing noise, and helping IT and businesses prioritize where they need to spend time. It reduces not only cost, but also helps increase revenue, mitigate risks, and elevates brand reputation through the reduction of mean time to detect (MTTD), mean time to investigate (MTTI), mean time to resolve (MTTR), and helping achieve SLO metrics.

Jerry Meisner (02:39):

I think Rizwan nailed it.

Josh Dougherty (02:41):

Jerry, I have a question regarding the fact that we don't live in a perfect world. Outages and equipment failures happen and AIOps can help reduce things like MTTR, as Rizwan talked about. For IT, how does that work?

Jerry Meisner (03:07):

So, AIOps works by ingesting data and events from various sources, which could include system or hardware failures. Today, you usually have an end user who would react to that system failure. Or, they'll get a page and they'll begin to make API calls against the cloud, or maybe something like VMware on premise to replace that failed system. In many cases, these alerts have a delay in being triaged or prioritized before work can even begin on them. AIOps can work to triage, prioritize those almost immediately, including rich details about the issue pulled from multiple data sources, which immediately makes the issues more actionable to the monitoring team. This saves some time.

As similar alerts are generated over time, automation can be crafted to perform common operations and leave the incident response team available to handle uncommon scenarios, resulting in higher resiliency and much faster MTTR for both common and the uncommon scenarios. AIOps can also be used to predict impending resource exhaustion, which can lead to proactive operations rather than reactive operations. So, if you see that a system's approaching a disc height, or it's constantly at the high end CPU, or something along those lines of failing in conjunction with that system, we can react to that early, instead of waiting for it to fail and responding.

Josh Dougherty (04:25):

Excellent. Anyone have anything to add about MTTR for IT?

Bobby Hallahan (04:30):

Being able to predict resource exhaustion, or any other type of anomalous behavior before it occurs is, for most customers, one of the more appealing aspects of implementing AIOps in their organization. Definitely a plus one on that.

Josh Dougherty (04:56):

Absolutely. What other benefits do AIOps provide for organizations, Rizwan?

Rizwan Patel (05:03):

Benefits provided by AIOps go beyond the typical, traditional measures of cost reduction, or cost optimization. Rather, it covers a much wider scope, transcending business metrics related to increasing revenue generation. It can and does help in proactively identifying and alleviating issues, preventing loss sales, customer churn, and, specifically, when applied to cybersecurity use cases, helps reduce risk and elevate brand reputation. Additionally, rather than treating the technology as a standalone, enterprises that adopt and integrate AIOps with their ITSM, CMDB systems have seen dramatic benefits resulting from automated knowledge management capture and reuse, accelerated ticket resolution from more efficient routing, tracking, and handling. They’ve typically also seen a reduction in the number of monitoring tools.

Tools such as application performance monitoring (APM), infrastructure monitoring, network performance monitoring, real user monitoring (RUM), et cetera, that they needed prior to integrating AIOps in their ecosystem. Finally, I thought this metric would be compelling to our audience. Companies surveyed by Enterprise Management Associate ranked AIOps as the most successful IT analytics investment, with 81% indicating that the value they got from AIOps exceeds its cost, including 42% who said that it does so dramatically.

Josh Dougherty (06:42):

So, if people are recognizing it as a valuable tool for their enterprises, what are some of the misconceptions that keep organizations from implementing AIOps, or from investing in it? Jerry, do you have thoughts on that?

Jerry Meisner (07:01):

Hiring the folks who know how to use some of the email and AI tools, hiring automation engineers to help assist with the automation endpoints that come after that, and those types of things would have an impact.

Josh Dougherty (07:16):

We know that tons of people are seeing the value of AIOps and implementing it in their enterprises. What steps should someone take to begin adopting it, Jerry?

Jerry Meisner (07:29):

The first step to adopting AIOps would be to identify a business case that AIOps could help with. So, determining how much impact something like downtime has on a particular product within the organization, especially in terms of dollars and reputation, really helps set up a cost bid of an analysis. This can be used to determine how much budget to set aside for implementation and run rate of an AIOps solution. The next thing would be to identify a potential AIOps solution and the effort to implement that solution, including the maintenance and long term manageability of it. There are some great resources out there that are already doing a compare and contrast against a lot of the AIOP solutions. We're going to discuss one of those as a later part of this panel.

Once you've identified the platform you want to work with, you'll start with a small use case, which might include a couple of data ingest sources that already exist within the organization, and begin elevating more meaningful tickets and alerts for the folks who are already monitoring those systems. It does help to already be leveraging infrastructure as code, or other configuration management automation tools, because the promise of benefits are realized ultimately as you begin to tie those tools and processes into the tickets, issues, and alerts generated through the AIOps ingestion platform.

Josh Dougherty (08:45):

Let’s switch the conversation over to the cloud and specifically AWS. We're lucky to have some folks from AWS here. Lucy, how does AWS enable AIOps for enterprises?

Lucy Hartung (09:05):

Thanks for having us here. One of the great things about the cloud and AWS is that you can really enable AIOps as a whole, or do it in the steps. Whether you're building everything on the cloud, on AWS, or migrating existing infrastructures onto AWS, we have several offers for AIOps enabled services that can assist around absorbability for your operation teams. From simply enabling CloudWatch anomaly detection to make your existing alarm smarter, to enabling and configuring X-Ray Insights, Amazon DevOps Guru GuardDuty, or more, AWS provides you several options. A lot of them are fully managed services that use a pre-trained machine learning model to recognize and surface anomalous behavior. This means that you don't really need to have existing ML expertise in house.

Josh Dougherty (10:13):

Bobby, I'm interested if you could add to that. What are some other reasons that AWS is a good place for AIOps?

Bobby Hallahan (10:26):

I think this boils down to three things. The first, which Lucy mentioned, is customer choice. When it comes to leveraging AIOps, or getting started, there are several options on AWS. We can start with SageMaker, where you're building your own ML models. You can look to include other managed AI offerings, such as anomaly detection for CloudWatch alarms, or leveraging DevOps Guru, or X-Ray Insights, where it's all right there. As long as you have those output signals, our tooling is able to predict and identify anomalous behavior on your behalf. At that point it just becomes, “How do I consume these findings? How do I respond to this anomalous behavior that's being reported for me?”

The next part is innovation. Our service teams are always looking to improve our services and add new and improved features into our AI offerings.For reference, take a look at the What's New page for SageMaker. We're consistently and constantly releasing new services and features around SageMaker, as well as other offerings in CloudWatch, such as DevOps Guru, or X-Ray Insights. Specifically with DevOps Guru, it's a relatively new service and there's a lot coming down the pipeline for us. We're excited for what's coming in the future. But, next and most importantly, I think it comes down to interoperability. What make AWS a great place for AIOps are many of the same things that make it a good place to run any cloud workload: the ability to leverage AWS services in conjunction with our other AI offerings.

Let's take an example where we have insights coming from DevOps Guru. Those insights are sent to AWS EventBridge, then from EventBridge you can process them from a variety of different targets. You could do a fan out through SNS and consume it in multiple different downstream applications. You could target it to centralized event buses and centralized accounts, then process it from there. You can invoke Lambda in response to those events, and then do something about it and determine, “Is this something I need to take action to?” Even to the extent of having that Lambda function, or some systems manager automation, try to automatically remediate that type of event given the prescriptive guidance coming from DevOps Guru.

So not only are we surfacing that, "Hey, here's a problem and we think it's something you need to know about," but we're also showing that there's prescriptive guidance. "This is what we think the problem is and these are some things you could do about it." That very last step is simply asking how we can automate in response. A lot of the points made earlier, specifically around investing in people that can build and maintain these automations, are incredibly valuable and important.

Josh Dougherty (13:16):

Say an organization wants to get started with the process right now. What first steps should they take if they're looking at using AWS for their AIOps practice?

Bobby Hallahan (13:43):

One of the first steps is making sure you're enabling whatever services you want to enable that will produce these output signals for you. Let's say you are using something like anomaly detection, or you have your own models in SageMaker evaluating these output signals from your applications. Or, maybe we're using DevOps Guru or X-Ray Insights. All of these are emitting output signals. That's the very first step. A key thing when it comes down to AIOps, and generally speaking with IT ops, is iteration. Without those events, you can't necessarily iterate on how to get better at dealing with those. Having them enabled is the first step, but, from there, understanding where to iterate and what to iterate on is incredibly important.

To help guide that, we take a look at some of those current business level objectives and think about where and how we can improve on those goals. We find low hanging fruit, easy applications you can experiment with, something you can do a proof of concept with quickly, fail fast, and iterate quickly. Those are paramount when it comes down to AIOps. This is something that we spoke to earlier. When it comes down to focusing it on business metrics, we can always look at MTTD, MTTI, MTTR. Having a more reliable application, having an application or workload that is more performant for your end customers, that's more reliable for your end customers, those are great things. Nobody's going to argue against that.

But, then being able to take that and elevate those business level metrics, that's where a lot of customers see value in AIOps. So, understand what you are implementing, understand what you want to measure as your yard sticks for success, and, from there, iterate and expand over time. It echoes a lot of the earlier points, but is a bit more specific to AWS.

Josh Dougherty (15:56):

Lucy, can you share some use cases you've seen customers leverage for AIOps?

Lucy Hartung (16:08):

Yeah, absolutely. Some of the most straightforward use cases, where I have seen customers start leveraging AIOps, are around improving their operation productivities by reducing false alarms. Many of the services Bobby mentioned earlier are surfacing critical issues and reducing the false alarming the operation team had to figure out in terms of, “What is real, what is not?” So, that's the first one. Second is reducing that investigating issue time. We have seen operation teams be able to quickly respond to an issue because of these AIOps services providing them additional guidance, providing them additional insights. Now they're able to get to the root cost faster based on guided information.

Third is dynamically learning about metrics and the environment. In today's customer environment, nothing is ever static. Everything is dynamic. There are new things introduced in the environment, there are applications that get updated. These AIOps services can definitely help the operation teams keep up to date with their alarms and their metrics by automatically detecting new things that are being ingested into the environment. That really reduces their time to manually set something up whenever there's a new thing coming into the environment. Lastly, it's a quite interesting one that we often see, where developers have leveraged some of those AIOps services to get more visibility into operations on how their application is being performed after they've been developed. Developers now have more knowledge about the applications being developed and how they're performing once they head to production. That gives them back the information, in product once the application is released.

Here I want to share a great story. I worked with a customer recently who enabled one of the AIOps services, Amazon DevOps Guru. The services automatically ingest and detect anomalous behaviors via CloudWatch metrics. They were monitoring for a normal error and throttle, but they were not looking for something called provisioning concurrency spill over invocation. In the simplest term, this means how many of your functions are going over the concurrency you set up for being able to scale easily. Going over this, DevOps Guru caught this anomalous behavior and was able to help this customer, preventing latency performance down the line. The operation folks were able to leverage the insight AIOps services was providing to prevent a real problem from happening later on down the line.

That's really the power of AI. We know that there are some areas where people are looking for arrows to detect, but there is a whole gray area where people might not be aware their issue occurred. The AIOps services can really be a benefit to help catch these problems before they've occurred.

Josh Dougherty (20:06):

Excellent. That's helpful, practical information about use cases and how people can benefit from AIOps. We're about to wrap up here, but, before we go, I want to return to Rizwan. Suppose someone's been listening to this and has gotten excited about the potential of AIOps for their organization. They're looking to make that pitch to decision makers at their organization. What elevator pitch would you recommend they use to build momentum?

Rizwan Patel (20:40):

AIOps is more relevant than ever before. This is influenced by the digital transformation wave that is sweeping through virtually every industry vertical, always on consumer end applications, elevated experience expected by consumers, necessitating a paradigm change from reactive to proactive response to issues. Market research company BCC Research estimates that the global market for AIOps will triple from almost 3 billion to 9.4 billion by 2026, at a compounded annual growth rate of 26%. Common misconceptions around AIOps being resource intensive, taking a long time to show value, and a lack of tools and technologies, are not nearly as applicable today. We tend to agree with Gartner's proclamation that there is no future in IT operations that does not include AIOps. At Redapt we are excited about helping our customers and partners reap the benefits of AIOps integrated in their DevOps journey.

Josh Dougherty (21:48):

Awesome. Well, that about wraps it up for us. Everyone, thanks again for coming on the panel. Thanks to everyone watching our discussion, as well. If you want to learn more about AIOps, you can schedule some time with one of our Redapt experts. Visit redapt.com/contact to get in touch. Thanks, everyone.

Bobby Hallahan (21:48):

Thank you.

Lucy Hartung (22:22):

Thank you.

Jerry Meisner (22:22):

Thanks.

BLOG

The latest in infrastructure, technology, and security

VIDEOS

Go deeper with expert stories, insights, and strategy

CUSTOMER STORIES

Discover how we elevate organizations

KNOWLEDGE CENTER

Stay informed with expert guides, trends, and webinars

ABOUT US

Get to know our mission, team, and what drives us

LEADERSHIP

Meet the leaders driving innovation and customer success

CAREERS

Join a team built on impact, collaboration, and growth

Actionable Insights.

Make-or-Break Focus Areas.

Experts Save You Time.

Contact Us

Implementing AIOps with Redapt and AWS

Contact a Redapt Expert:

Key Benefits

Reduce costs

Prioritize

Limit disruptions

Rely on fewer tools

Reduce costs

Prioritize

Limit disruptions

Rely on fewer tools

Migrate to AWS the Right Way

Insights to help you get ahead

The AWS Reserved Instances (RI) Management Program

Control Your AWS Costs

AWS Migration