The Redapt ML Accelerator

David (00:12):

Thanks for joining the Redapt webinar. If you didn't know, Redapt is a technology solutions provider focused on helping our customers navigate the ever-changing and complex landscape of enterprise technology. We believe that highly successful organizations are powered by highly successful technology, and today we're going to be presenting our ML Accelerator, which is how we're helping organizations successfully adopt ML.

Presenting are the creators of the ML Accelerator, Paul Welch and Bryan Gilcrease. Bryan is our Senior Solutions Architect, and Paul is the Senior VP of product engineering. Together, the ML Accelerator is their brain child, and they're very experienced not only with the hardware that is required to power these workloads, but they really know the ins and outs of the technology itself and how to help our customers get started. I'll let them introduce themselves and give a little bit more about their experience, and then Paul will drive.

Paul Welch (01:38):

Thanks, David. My name is Paul Welch. I focus on product engineering, and for Redapt that means combining pieces from a wide variety of our partners, both hardware and software, as well as the engineering expertise and services that we do, to come up with higher value solution offerings for our customers. Bryan.

Bryan Gilcrease (02:04):

Thanks, Paul. Bryan Gilcrease. I'm a Senior Solution Architect here at Redapt, and I've been working with customers to build out big data analytics machine learning solutions. And this is the culmination of all our experience and conversations with what customers are going through and what we see in the industry.

Paul Welch (02:40):

So what we're going to talk about today, as David mentioned, is our ML Accelerator. We're going to cover an introduction to why enterprises are looking to adopt ML, as well as some of the challenges involved, really focusing on addressing those with our Accelerator program to help organizations do it quicker and with less risk and more certainty than they would without our program. Bryan, do you want to add anything to the agenda about the Dell partnership?

Bryan Gilcrease (03:23):

This is a co-webinar with Dell EMC as a sponsor, and we built this out with Dell EMC hardware as a reference architecture to fit into the portfolio. This ties in nicely with some of the stuff in reference to architectures that they've done around this, and we've picked flexible platforms that make this able to work for our different customer needs.

Paul Welch (04:03):

So why are enterprises looking to adopt ML? I'm sure everyone's heard about AI and ML at this point. There's really been a perfect storm of developments over the past five to 10 years that's enabled it. ML or machine learning, can solve problems using these developments that were not feasible to even consider running on a computer in the past. One of those is advancements in the ML algorithms themselves, the math and the software algorithms, things like deep neural networks and transformers that can be used to optimize business decisions. For example, when to order inventory items or how to prioritize your marketing budget.

In addition, many of these algorithms — they're called models when they're built — could not have been trained before today's high performance infrastructure like the Dell EMC servers, and in some cases, acceleration with specialized hardware like GPUs. In addition, the massive explosion of data that's available today from sources like the internet and social networking, IoT devices at the edge collecting data and so forth. These models can even be used to drive automated decisions.

As an example, I had a real life example of this the other day, myself. I had a customer-support issue with an order from a large e-commerce site that I won't name, and my package arrived without the product in the box. So I had a hard time actually finding a real person to talk to. I tried their chatbot and within 10 to 15 minutes, I had the problem resolved with a replacement item being shipped without any human intervention. And that was all driven by machine learning on the back end.

These chatbots were great early on when many companies began automating their customer service process, giving a better customer experience to customers and reducing the human intervention that's required, reducing the cost. But ML is not just for chatbots and not just for the retail industry or e-commerce or even high tech. It's being adopted for a wide variety of use cases across just about every industry, from high tech companies to education, manufacturing, retail, oil, and gas.

Just a few examples might be customers that are using ML techniques to do better fraud detection for credit card transactions or insurance claims, customers who are in the financials and investment industry coming up with trading strategy recommendations or portfolio allocations. And in healthcare, especially in these COVID days, very exciting developments are happening with diagnosis and treatment recommendations. ML is even being used to develop new medicines and drugs to treat people.

So with all that potential, why isn't everybody already doing it? Well, the simple answer is, it's hard, and we're going to go through a few of the pitfalls. There's a lot of complexity and challenges, and a lot of this is brand new. If you were going to start from zero, it could take you a very long time to avoid all of these pitfalls and get to the end solution.

As I mentioned previously, finding the first problem to work on and building a business case is getting easier all the time. There's a lot of examples. All you need to do is search on the internet for some examples, or you might even just need to look at what your competitors are doing for ideas. So that's the easy part. The challenges have more to do with, first of all, getting started. There's hundreds of tools, hundreds of frameworks and libraries, and a wide variety of orchestration engines. You need to know what hardware and how much you need integrating all those pieces so that they work together. It's definitely not trivial. Bryan and I can attest from our own experience. And for many customers going through this process on their own, it could take a year or more to actually get it all working.

So to avoid that, what many companies do is buy high end, very expensive workstations or maybe a dedicated server for the data scientist. This is not the optimal solution. It's a quicker way to get data scientists into experimentation, but it's more expensive because each box is dedicated to one person, not shared. It's also usually not a managed solution by IT. So those boxes miss out on things like patch management and systems management. The worst problem about that approach is when you need to go to production. Let's say the data scientists have come up with this revolutionary new ML model that's going to make a huge increase in revenue, but they need to run it in production. The architecture of that high-end workstation under their desk is almost certainly going to be different from what you need in the datacenter to scale and run reliably.

So a lot of those pitfalls have to do with the learning curve, and there's the fact that there's not a 100% pre-integrated product on the market today that you can just buy to do it. You need to find all of those individual components, that specialized hardware, and make them all work together. And that's one set of challenges. Then there's also an organizational divide or a disconnect in many organizations between the data scientists who are experimenting and building these models, which is really similar to a software development process, and what the IT organization does to manage the datacenter and software products that are being promoted to production by the software development teams. Very different processes, very different architectures as well.

In fact, these all contribute to what many studies have shown, which is that more than 80% of ML models never make it out of experimentation, never make it to production. So that's a huge risk to a company making that investment to experiment and try all of these different models and techniques, for most of it never to be realized. And another set of the challenges I think is in this diagram, which has been adapted from a Google white paper about the technical data of machine learning. The key message here is if you see a small box in the middle called ML code — that's really what I've been talking about with the data scientist experimentation and development process to build a model — the reality is there's a lot more that needs to be in place to make the whole process work as intended.

When many companies start out, they think, we just need to hire a team of data scientists, and then magic will happen. But as you can see, you need things like data and data feeds. You need to collect the data, clean the data, have a pipeline to feed the data into your process. You need configuration management tools, you need modeling tools. And very importantly, you need tools and processes to be able to deploy those trained models into production and scale them. And I guess one thing that's not obvious in the picture is building a machine learning model. This end-to-end process is very analogous to developing software, and there's a huge advantage to adopting best practices that have been developed and agreed upon over many, many years of developing software, adopting those best practices in your machine learning development process, and also adopting processes from SRE and DevOps, for example, in how you operate the production and machine learning environment.

So those peripheral boxes are all things that Redapt is very good at and has been doing for many, many years, even though a lot of the ML code box techniques and algorithms are newer. We identified a lot of these opportunities and challenges, and have a huge demand from customers driven by their own ROI opportunity of implementing ML. And we looked at a lot of the reasons why they're not already doing it, and the many of the challenges they were facing, and we came up with this ML Accelerator program to address them. We've spent a significant amount of time coming up with a reference architecture for the hardware and software — all of those hundreds of different pieces that work together, are integrated, and can be delivered in a ready-to-use-rack from our facility to your datacenter.

We combine that with our engineering expertise in advanced analytics, as well as in things like Platform Engineering and DevOps. We feel this is the best way to get you to production quickly and to avoid a lot of the pitfalls. It's architected in a way that follows how many or most of the very large scale research labs are building out their own AI datacenters, and so we know how to scale. It's built on a foundation of cloud native tools, like Kubernetes and containers that we have scaled many times, very large for customers.

Bryan Gilcrease (16:16):

There really is a lot of engineering and work that goes into delivering a machine learning model into production. Working through this, we wanted to come up with something that addresses those challenges for different organizations at different steps in their journey to deploying scalable machine learning at production.

We also wanted something that started off in a small package, with a minimum footprint, for these organizations to get started. They could start developing some of these development skills that Paul talked about for machine learning. And then they could take that and grow and scale that out as the projects grow, or as they expand into more organizations or become a hub for machine-learning, however large they want to grow.

So putting that together, we did that with Dell EMC hardware. We wanted enterprise grade hardware, something that has a proven track record and something that IT organizations are used to managing. The Dell EMC hardware with the IDRA Control and server management features, that is something that's been around for a long time and makes support and monitoring possible by the IT organization. So we took that and we built up, like I said, a minimum footprint to run our software solution on top of that. We also wanted to make sure that we could support running training for large models or deep learning models with GPUs. So we have an infrastructure portion, and then we've got compute nodes that scale out and support GPUs to make this possible.

For a few of the key components from the software stack, we'll be looking at Kubernetes and Kubeflow to support Jupyter notebooks pipelines, things like that. And we know that a lot of these are new features, so we want to put all of this together and create a services flow that mimics the infrastructure. We have a group of data scientists who are working with customers to build out these types of solutions, and we want to transfer that and to be able to do that on-premises in a build out Dell EMC reference architecture.

Like I mentioned earlier, we base this off of Kubernetes, and we wanted to go with a Kubernetes that is easy to use and support and deploy. So we built this out with Rancher, which makes managing different Kubernetes clusters very easy. It's HA configuration, highly available, and as we start even with the smallest piece, it's resilient and can support outages, which, if we've found anything in IT, it’s that they will happen. We want to make sure that we've got support for traditional machine-learning and CPU-based training, as well as acceleration for some of the larger models, like what I was talking about, using GPUs. This is unique, because a lot of times a company or a vendor will focus on one or the other. We want to make sure that we have a flexible platform that ties in your existing workloads, as well as accelerating larger workloads with GPUs.

We went with Kubeflow as a workflow manager. This is a project that came out of Google and is being adopted across the industry very quickly. And this is going to run in containers on top of Rancher. One of the things that Paul talked about earlier on was that putting together all these different pieces can be very difficult and can take a long time. And Paul and I, we went through this. We've had to struggle finding versions that worked with each other, making sure that there's no incompatibilities and that there's a reputable way to deploy all of this so that whenever you're sitting down and you're trying to use it, you're not working through infrastructure issues. You're focused on creating models and delivering business value.

One of the easy buttons in Kubeflow is it integrates with Jupyter notebooks for self-service. This allows data scientists or analysts to spin up notebooks and start working through their development life cycle to see results interactively. And this runs on the cluster in a container. So, once they're done with that, they can destroy the instance and the resources are free and back to the traditional workloads. One of the things that the Kubeflow's integrated is also a spark-operator, but on top of that, you could use that spark operator, or you can just run spark natively on Kubernetes. This allows you to do your data processing and everything separately from your machine learning, but all in the same solution.

Part of some of the problems that we're trying to solve is to make the development environment for data scientists more supportable by IT organizations. To do that, we included Prometheus and Grafana, and it's integrated with Rancher so you can monitor all of your cluster metrics and everything like that. It really makes things simple for IT organizations. Looking through this, we could have picked any hardware vendor to deploy this with, but we have a long history at Redapt working with Dell EMC, and that made them a great fit for the work we're doing here. We've worked hand-in-hand with their high value workflow team on many projects in the past.

We're a strategic OEM partner. We're able to work with you if you have software that you're deploying to customers, and we can take that and we can brand that, and we can deploy OEM solutions together with Dell OEM. And I think that some of these different pieces is what makes this solution special and have a little bit more value than just picking any hardware vendor to support the hardware. In the machine learning space, I think they're especially valuable. As a company, Redapt has worked with a lot of cutting edge technology companies, wbe tech companies, and companies that have been doing ML and large scale out computing.

We take that knowledge, and we take that understanding of what it takes to support a solution that can be scaled and reliably support and manage as that grows. So taking that and pulling that down, and incorporating it into a bite size piece with this ML Accelerator, was one of our primary goals. And Redapt is placed very well in the fact that we have that experience building out that scalable hardware and infrastructure, as well as services on top of that to support building out the workloads. That's where this comes together. It's not just, here's your hardware with a software stack on top of it, go learn this. We can sit down and work with you. We can come up with areas in your business that might benefit from machine learning. We can take those areas, we can build POCs around those, and we can start creating models. We can do that either from scratch, or we can take some model that's been created by a company, like for natural language processing, and adopt that to your specific use case. And we can see that all the way and through to production.

David (29:00):

A couple of questions sent in through the Q&A. One was, I'll just paraphrase it here, they're interested in doing some ML, but before they engage, are there some things that they should be thinking about in regards to data?

Bryan Gilcrease (29:21):

That's often one of the first places to start when you're creating machine learning initiatives. Looking at what data you have available, or what data you need, and then taking that and starting to do simple analysis on it and finding maybe outliers or bad data, and start the process that way with simple experimentation. I think any initiative you have will start with data.

David (30:04):

In fact, we have an entire practice dedicated to that. There's one more question. Getting started, what does an initial discovery meeting entail? And then, what information should be collected, or is important to know, for that meeting?

Paul Welch (30:31):

That's a really good question. I'll give my two cents and then, Bryan, you can chip in if you have anything to add. But I think as far as initial discovery, I think having some understanding of what's going on in your industry, what your competitors are doing, and what's possible in terms of what new types of problems ML solves, are good educational types of homework to do before you get started. There is a lot of content out there to read. Having said that, our Accelerator program includes professional services, an engineering-driven jump start to help you get started. So it's not absolutely required that you know everything about what you're doing before you jump in, because we're there to help guide you.

What's helpful to know going into this? Well, I think some of the information that's useful is knowing what your IT operations processes and organization looks like. That will help us to understand how to translate what this architecture is to your team, so that they can be comfortable taking it over and fill any gaps in terms of operations and management that they need to fill in.

Bryan Gilcrease (32:36):

In this day and age, it's very rare that we talk with customers that don't have some machine learning initiative already, even if it's in a specific business unit or just on a team here and there. So, if you could think about that and talk to those groups that may or may not be doing that, it would be useful to come to the table and say, here's what we're trying to do, and these are the pitfalls that we've already found. That can be a great place to start. Like I said, I think most companies have some of these teams doing this already.

David (33:16):

One more question trickled in here. The question is: I've heard data scientists are pricey, right, and just from my own experience, I've heard $400 and $500,000 in salary a year compensation. I don't know if that's true or not. I think for some exceptional ones that would be true Are there machine learning models or technologies that are applicable to smaller size companies, that maybe we don't need to make the investment in a data scientist but we can use some models? Or is there another way to do that?

Paul Welch (34:11):

I have a couple thoughts on that, and Bryan probably does as well. So first of all, in terms of comp, it always depends. Data scientist is a pretty broad title and sometimes means different things to different people — anything from people really more focused on building pipelines at a software engineering level, all the way to someone with a PhD in math, who's doing pure research. So it's hard to tell as far as the cost. But the one thing that's really encouraging is that there are so many prebuilt, or I'll say, halfway built models available from these research groups at large companies. Google, Facebook, Microsoft, Amazon, and probably a hundred others, all have research teams with a building full of those math PhDs, who are building models completely from scratch as a research R&D effort.

Many of them are open-sourcing and making available that research so that other companies can take advantage of it. And then, NVIDIA... I should have mentioned NVIDIA as well has taken some of those open source models and prepackaged them into NVIDIA optimized containers that we can use on our platform. And a model like BERT, which was originally developed by Google for natural language processing, and is now what a lot of companies use as the underlying foundation for chatbots, understanding the text that's being put in and how to respond to that. So what you can do is use that as a starting point and incrementally add to it, to customize it for your use case. That's a much easier problem to solve than building it all from scratch.

David (36:37):

Well, thank you, guys. Thanks, Bryan. Thanks, Paul, for your time. And I really appreciate it. And if any of you have additional questions, just reach out to us at Redapt and we'll connect you with Bryan and Paul, so you can dive deep into machine learning.

Paul Welch (37:02):

Thanks, David.

David (37:03):

Thank you, guys.

BLOG

The latest in infrastructure, technology, and security

VIDEOS

Go deeper with expert stories, insights, and strategy

CUSTOMER STORIES

Discover how we elevate organizations

KNOWLEDGE CENTER

Stay informed with expert guides, trends, and webinars

ABOUT US

Get to know our mission, team, and what drives us

LEADERSHIP

Meet the leaders driving innovation and customer success

CAREERS

Join a team built on impact, collaboration, and growth

Actionable Insights.

Make-or-Break Focus Areas.

Experts Save You Time.

Contact Us

The Redapt ML Accelerator

Contact a Redapt Expert:

Artificial Intelligence & Machine Learning

Accelerated AI/ML adoption

Data science

Actionable insights

Improved customer service

Accelerated AI/ML adoption

Data science

Actionable insights

Improved customer service

The technology powering AI and ML is changing rapidly. Get up to speed on the needs of these transformative technologies.

Insights to help you get ahead

The Makings of a Modern Enterprise Datacenter

Designing High-Performance Datacenter Infrastructure for AI Workloads

The Enterprise Guide to Kicking Off the AI Adoption Process