Speaker 1: 00:13 All right. Well, let's get started. Today Redapt is presenting on Building a Modern Analytics Platform with Google Cloud. As we go through this presentation, please feel free to send in questions via the chat feature and we'll stop periodically to answer those.
Let me introduce our presenter. It's Christof. He's had about 30 years of experience in data and analytics and he has all the GCP specializations, with obviously a lot of experience prior to GCP. And, he works on quite a few of our most critical accounts, helping them with their maturity in terms of how they handle data and prepare it for analytics. And, he's really good at going through, finding insights for our customers, and producing a platform that can generate value for years. So, I'll turn it over to you, Christof.
Christof von Ra...: 01:23 Thank you. So we're going to start with one of the questions that you might have. Why are you here on this webinar? Why would you be watching this webinar? We’ll start with, maybe, that you're here to learn what makes a modern data platform, trying to understand how it will help you gain new insights from your data, and how it is different from something you already have in place. Or, maybe you're here to learn how to build that modern data platform, find out the solutions that Google Cloud offers in order to implement that, and what skills are needed in your organization to make that happen.
Or, maybe you're here to learn about migrating your enterprise to that modern data platform and the things that you need to consider for a successful migration. And, what are best practices to keep in mind and how do you choose a partner to help with that implementation? So, we're going to dig into that.
Our agenda today is to have an introduction, talk about what this modern data platform is, look at some examples, look at blueprints, and then identify some of the key areas of benefits you might see as an organization. Then, we'll move into building a modern data platform. What are the general components? What are the best practices? What does the process of building that look like? Then, what does the modernization mean for your team? Lastly, we'll move into the migration aspect - why you would choose Google Cloud as a platform solution, how to choose a partner, and then working with GCP and Redapt.
So, let's dig into this modern data platform. What is a modern data platform? Well, a modern platform allows you to leverage the scalability of the cloud. So, by scalability, we're not just increasing the computer processing power, memory, and storage, but we're also instantly adding additional machines to the problem. We're responding to peaks and valleys of resource demand. So, we're adding compute power as needed, and we're removing compute power as demand decreases. A modern data platform also ensures proper security and governance are in place. Security is now a shared responsibility, but with powerful allies. These allies are assisting you with compliance regimens like HIPAA, GDPR, PCI-DSS. They're also assisting with auditing - identifying who did what, when, and where.
The modern data platform also democratizes data access. With strong governance built in, you put data in the hands of the decision makers, people who can actually take action on the insights and eliminate the impression of IT as gatekeepers making arbitrary decisions. And lastly, the modern data platform employs advanced analytics tools like artificial intelligence and machine learning. Suddenly we're able to see insights that we would rarely find sifting through at human scale.
As we move on to examples of modern data platforms, let's take a look at what a blueprint looks like. Well, the modern data platform is automated. With the speed and scale of data that is coming in today, we can't wait for a human to intervene to manually trigger an event. These workflows have to happen automatically. It also, in regards to automation, needs to be repeatable, reliable, and return the same result for the same set of problems.
It needs to be advanced analytics ready. Right now the speed of analytics can't be hampered by finding the right plug to put in the right socket. It also has to be flexible in responding to the change at the speed of new insight. Lastly, it has to be governed and secure. I can't stress enough that your data is one of your organization's most valuable assets, and there's a responsibility to protect that for the organization, but there's also responsibility to protect that for the users of that data. So, if I'm submitting data to your organization, you've got a responsibility to protect that data.
Let's look at a couple of examples. This here is a sample build out of a recommendation engine using the Google Cloud platform. You can see on the left-hand side, we've got data sources coming in. We've got inventory data, we've got purchase data, we've got wishlists, and we've got reviews. And that data is coming in from Cloud SQL. Cloud SQL is Google's managed database service that makes it easy to maintain, manage, and administrate your own relational database. It supports MySQL, it supports PostgreSQL, and now it also supports SQL Server. And then, we also have Cloud Datastore with a managed scalable, no SQL database focused on high availability and durability.
Those data sources are then feeding into our ETL process, where we're transforming and enriching the data. And here we're using Cloud Dataflow, which is a fully managed Data processing service, automatically provisioning and auto scaling, and it's reliable and consistent.
Out of that transformation, that data gets stored in Google Cloud storage, which is secure, durable, high availability, low-cost storage. That data then is accessible for our machine learning tools. In this example, we've got Cloud Dataproc, which is a fast and easy and fully managed service for Hadoop and Spark. And then, we've got machine learning and prediction APIs. These are hosted solutions to run training jobs and predictions, all at scale.
In none of these do you need to worry about the underlying infrastructure. Lastly, we've got BigQuery in our analytics, and here we've got a serverless, highly-scalable, cost-effective data analytics storage that can process petabytes of data in small amounts of time, with zero operational overhead. I can't stress enough the value of BigQuery and analytics. This is one of the crowning pieces of the Google Cloud environment. Below that we've got our applications. This is our presentation layer to the customers. We've got shopping cart, browsing, and outreach, and all of these can be bundled within that platform, as well.
Here's another example of architecting on the platform. This example is using the internet of things and using sensor stream ingestion. On the left-hand side, we've got devices sending data into our gateway. That data is then fed into the ingestion portion of the platform. We've got Cloud Pub/Sub. Cloud Pub/Sub is a message-queuing solution - there's monitoring built in and there's logging built into the ingestion pipelines.
Cloud Pub/Sub is then the automated trigger to starting the workflow in Cloud Dataflow. And Cloud Dataflow in this example is sending information to storage in Google Cloud Storage and Datastore, which we've already talked about. Cloud Bigtable is a massive, no SQL data solution. Again an unmanaged solution, and we've got our analytics, we've got data flow, BigQuery, Dataproc, which we've touched on. Then we've got Datalab, which is Google Cloud's implementation of JupyterLab. From here that flows into our application and presentation layers.
So what are the benefits that your organization might see? Well, an ability to optimize marketing initiatives. I've touched on Google Clouds for a marketing solution. This is a 360 degree view of your customers from sales data, from customer relationship database, from social media, from their click ads, and all of that data is brought into a solution where machine learning gives you actual insights to find the lifetime value of your customers and target your marketing initiatives. It's a fantastic product.
You can also streamline your supply chains, using data-driven predictions on lead times for goods and better management of shipping logistics, forecasting of sales trends using predictive analytics, delivering and designing better products. All of these are things that you can see when you start leveraging this modern data platform.
So what does building that platform look like? What is it going to take for your organization to get into that place of success? Well, first let's poke at what the components are. So we've talked about an overview of what it looks like. But, when we dig into the real nuts and bolts, you're going to have a storage solution and that storage can be in the cloud fully, or it can be a hybrid solution, but storage is critical. You need to be able to identify where that data is going to go, how it's going to be stored, and the constraints you have on that data.
Those constraints may be why you have a hybrid solution. Because you do not have the ability, because of governance, or maybe licensure. You might not be able to move the data to the cloud. It may need to remain on premise, partially. And then, your platform's going to have an ETL or an ELT data system there. We're using this to create reproducible results with automated orchestration. And your solution has to have open data access, which allows visualization tools and self-service business intelligence to reach all levels of the organization, and machine learning and AI can pick up that data and automatically produce valuable insights.
You also are going to see virtual data consolidation in a modern data platform. This allows for data consumption, orchestration, and analysis without ELT or ETL of the original source. Basically you're bringing in a view of that data, sending it through your pipelines, and never manipulating that original source. It stays pristine.
Then it's going to have robust data indexing and security measures in place. We've talked about security, and the value of your organization, and the responsibility to maintain secure data. Indexing is going to give your organization a common language and improve the data governance and security across the organization. Lastly, a modern data platform includes a data life cycle management solution. This means automated deprecation of versions of software. It means storing data into archive, or deleting data based on rules that you create, and the data life cycle management then takes over and operates.
If we take these components and stick them on top of Google Cloud's tools, we can see here, starting on the left-hand side, we've got capture. We've talked a little bit about data ingestion with Cloud Pub/Sub, and we've talked about the internet of things coming in. Data is being streamed in. We also have a data transfer service, which is cloud-to-cloud or bucket-to-bucket transfer of data. Or, from on premise to cloud. And then there's also a storage transfer service.
This is for transferring large-scale data to GCP. Sometimes sending it over the wire isn't the best way to do this. So we have to implement different mechanisms for massive scale data. Then we need to process that. We have ELT or ETL processing. We've talked about Cloud Dataflow a little bit, and we've talked about Cloud Dataproc. Well, there's also a tool called Cloud Dataprep. This is a tool for visually exploring, cleaning, and preparing your data for analysis and machine learning. This is something to put in front of your non-technical staff. They can take a look at a CSB file or an Excel spreadsheet, and identify the columns of interest. “Here's the way that this column needs to be manipulated.” They don't need to learn programming to do this.
Then we go into the data lake and data warehousing. We got cloud storage and we've got BigQuery storage, and I'm going to circle back to that in a second after telling the fact that BigQuery has an analysis engine. You see that analysis happens separately from storage, BigQuery. One of its strengths is that it separates storage of data from the actual analysis of the data.
Google can increase the compute power as needed for a large-scale analysis. That's how it responds to these petabyte requests. Lastly, we've got advanced analytics where we can use cloud AI services, and TensorFlow is a machine-learning tool. And Google Data Studio is Google's implementation of data visualization. It's a web-based interface, very much in the order of Tableau, Looker, or Power BI. Your users could connect to the data you have stored within your platform and visualize and gain insights. Most people aren't aware of the fact that Google Sheets can connect to a lot of the data solutions that are stored in Google Cloud.
If we look at that top row, then we've got a line beneath that with Cloud Data Fusion. Cloud Data Fusion is a mechanism for connecting to disparate sources of data. This is where that data virtualization piece comes in. There are 150 plus pre configured connections and transformations that can just be automatically plugged in, that are just part of the package of Google Cloud. Then we've got Data Catalog underlying all of that. Data Catalog is our mechanism for managing information and our resources, and Cloud Composer is the bottom layer of that. That's where we do orchestration.
So, let's take this and plop it into a big data event processing solution. Here on the left-hand side, we've got streaming input and we've got batched input. That streaming input comes into Cloud Pub/Sub, just the messaging service. That messaging service says, "Hey, I have new information." It sends it on to Cloud Dataflow. And that Cloud Dataflow then processes that data and sends it into Bigtable for further analysis. That batch processing brings that data over into the ETL system. Again, via Dataflow, feeding that data into Bigtable. From there we're feeding that data into the rest of our solution, where we've got analysis tools, reporting and pushing out to our mobile devices.
When we talked about moving to a modern data platform, clearly data is the key. We have to have some best practices in mind when we're looking at what we're trying to accomplish - we have to have the right amount of storage. We have to know what our data sources are. The data is cleansed and optimized, and security and governance are in place as the data arrives and is processed appropriately.
Our challenges are, how do we do this? Well at Redapt, we like to talk about the four Vs of your data. The first V is velocity. How fast does that data arise? What is the cadence of my batches? How often can I expect updates of that data? The second V is volume. How much data am I receiving? That's going to directly impact the kind of storage that we need to set up. So how much data is it? Kilobytes of data at a time? Am I getting megabytes of data? Am I getting gigabytes of data?
The third V is variety. What kinds of data am I getting? Do I have a single source of truth in this data? Do I have multiple sources coming in? Do I have social media fees merging with log files, merging with ad click data? Then the veracity of that data. How clean is it? What do I need to do in order to get value from that data? So we've got four Vs, velocity, volume, variety, and veracity, and that's going to have an effect on choosing the right solution.
So now that we know what our best practices are, what's the process of getting there? First is to identify and clearly understand your technical maturity. This is not a knock on your organization. It's about being honest, so that everybody's on the same page. A mature organization is agile, they're adaptive. They can rapidly scale up or down and shift operations. And, they're also innovative.
Now the last several months have probably done a really good job of testing your organization's maturity. If you didn't hiccup when all of a sudden your workers had to be at-home-workers, then you're probably a good, mature organization. If you were scrambling, putting solutions in place, trying to figure out how to do this, you probably have work to do there. This is not a knock, it means you just have an opportunity for change.
Once you identify your current capacity, then you can identify the goals in modernization. You have to make sure that there's agreement throughout the organization on what those goals are. And everybody's looking in the same direction, the clarity of voice, that hype cycle of, "Oh, we're so excited about this modern data platform." And then delays incur and things start falling apart. You start hearing words like “your system” and “my tool” instead of “ours.”
The second step is data assessment. If I identify what data you have, where the data is coming from, and all gaps in your data. So when I say, “what data do you have?” It's not just the data that you know about. It's the data that you don't know about, or are not paying attention to. It's estimated that 90% of the world's data has been created in the last two years.
There's data like log data. We all know about social media data, but there's social media metadata. There's all sorts of data your organization has access to, but may not actually know about. I can speak to an example of any and all gaps and where the data is coming from. At one point, I was working on a solution and the organization had terabytes of aggregated product data.
This data was aggregated at the weekly level and it was aggregated at the day level. The question came up, “Can we aggregate this at the hour?” But we also we needed to add some enrichment to that data that was only in the raw files. And so my first question was, "Well, where's the raw data?" I had to talk to four different people within the organization before someone actually even knew what I was referring to by raw data. They kept pointing me to, "Well, here's the data. This is the data that we're using." Once we found the raw data, it was in archive storage, and we had to rerun significant processes just to get access to that data. So, you need to know where that data is. You need to know where the gaps are in the data you actually have.
Step three is looking at cloud adoption. This is deciding what workflows belong in the cloud and deciding whether or not you're going to be a fully cloud solution, or if you're going to be a hybrid solution. We've kind of touched a little bit earlier on why it might be a hybrid solution. You might have software that's not conducive to moving to the cloud, either because the resources can't be reproduced, or you may have licensing constraints and have been putting it to the cloud. You also might, again, have governance constrictions on moving some of your data to the cloud. So, you would end up being a hybrid.
Once you identify the workloads that are suitable for the cloud, then we can identify which cloud provider or providers to partner with. You may find that one cloud provider does something better than another one does. In that case, you might have multiple providers. Every project that I have worked on at Redapt so far has been a multi-cloud solution.
Once we've identified the maturity of the organization, what data you have, and what workflows are appropriate for the cloud, then we can start looking at what advanced tools we can implement for predictive analytics, artificial intelligence, and machine learning. Some examples of artificial intelligence that Google offers are: there's visual AI and there's sentiment analysis. Sentiment analysis is looking at your chatbot and determining whether or not the customer that was chatting on the chatbot was annoyed. There's translation AI, and translation AI is pretty incredible. Now it can actually take an image that's in a foreign language and output a translation.
Lastly, in building out the platform, what does modernization mean for your team? Modernization is a shift. And there's no question that organizational change can be difficult. For IT, one of the largest challenges is understanding that their role is no longer managing hardware and software, but shifting to governance and visioning. Thinking about what are the possibilities, rather than what are the limitations. IT teams that are aligned with business become way more valuable to the organization and business when they recognize that IT is working within a partnership with them. The questions can become, “What do we need? What can we do?” Instead of saying, “We don't have the recess sources for that, or there's no way that I can set that up in time.” And now we're getting to the migrating to a modern data platform portion of the presentation.
And, I'm here to talk about Google Cloud and its platform solutions. I'm sold on cloud. That's why I've spent the time that I have in becoming knowledgeable. Some of the benefits of the Google Cloud platform, without question, is their leadership in Kubernetes. They are the original developers of Kubernetes. By far they have the best cloud implementation of Kubernetes and cloud containers.
Google also has some very innovative pricing. When it comes to virtual machines, their virtual machines are highly customizable. You can customize CPU, RAM, disc, type of disc, and GPU independently of each other. Yes, there are standard types that exist. You can point and click and say, “Just set me up with this type of a virtual environment.” But, customization is simple and easy to implement. This stands in opposition to other cloud solutions, where, if I need to increase my CPU, I have to increase my RAM.
If I increase my RAM, I have to increase my CPU, or, if I choose a specific disc type, I'm boxed in and have to pay specific pricing. Google is also very innovative in their billing, in pay-per-second billing. One of the objectives here, with our cloud workflows, is we will bounce up a cluster, run the workflow, and deprecate the cluster. You pay only for the time that workflow is actually in place and operating. It's not by the hour. It's not by the quarter hour. It's not by the minutes. It's down to the second.
Google also has automatic discounts on long-running workloads. This is not something that you have to call up Google and say, "Hey, I've got this long running cluster. I've got a cluster that's up 24/7. It's just there doing its job because there's so much data we have to process." The Google billing system recognizes that cluster, it recognizes that it's on, and it recognizes that it's working and automatically will begin to discount the cost of that workload.
Google Cloud also offers custom image types for creating instances of specific needs. Yes, there's images for SQL server, and there's image for an Ubuntu server, and what have you, but the ability to create a custom image type for your specific use case is really important. Especially as you start developing tools that aren't necessarily normal package tools. So, you can create your Ubuntu image that has X, Y, and Z built into it and you don't have to rebuild that image. You don't have to rebuild the VM every time you start one up. You just point at that image and up it comes, ready to go.
As we've talked about, lifecycle management is kind of critical to a modern data platform. Google has the obvious things like auto deletion or auto deprecation. But, one of the things that is really unique about Google's life cycle management is changing the storage class of an object. So by storage class, I mean, if Google has various differences for pricing, depending how regularly you access specific data.
If this is data that I access on a daily basis, you'd want to have it in this specific class, and you pay more for that regular access. Then, if you only access that data once a quarter, then you can put in this different type of class, and you pay less for that. Then there's old, and then there's archival.
So that data, those exist in other cloud solutions. What's unique about Google's implementation of lifecycle management is, you could change the storage class of an object right within the same bucket, which is equivalent to a harddrive, as all the other files that you're operating with. You're going to have to move that file to a different storage bucket within that storage bucket. You can change the storage class.
Lastly, and I think this, to me, is one of the unappreciated values that the Google Cloud environment brings, is the user-centric interface. The GUI itself is very intuitive, but also the command line interface is extraordinarily intuitive, as well as the interface between all of the resources that exist in Google Cloud. My first project when coming to Redapt was a data migration project from AWS to Google Cloud. Iit seemed like a pretty straightforward problem, but it turned out to be a little more of a headache than the surface looked. I needed to go into the AWS environment and create a compute instance that would allow me to do specific things.
I was mind boggled by the hurdles and the barriers that were in place and the lack of intuitive nature of the AWS environment. When I was so used to the Google Cloud environment, where I bounced up an instance, it automatically was loaded with network connections. I could go into that network connection. I could set my firewall rules and, tada, I was finished. That was not my experience in AWS.
So here we are, you've decided you're going to the cloud, you've decided on your cloud solution. So how do you find the right partner? How do you find a modernization partner to provide you with a cloud assessment, and that cloud assessment will determine the capacity and technology best suited for your organization? They're also going to provide you a cloud adoption framework, so you can assess for yourself what services are most beneficial to your organization.
They're going to assist in navigating cloud migration challenges, so the transition to the public cloud won't lead to disruptions or downtime, and they're going to help you with implementation of best practices to address security, compliance, and governance. These are all core practices at Redapt. In addition, we have teams dedicated to application modernization, modern data center, advanced analytics, and emergent technology.
In conclusion, you look at a modern data platform. If it's going to leverage the scale of the cloud, provide proper security and governance, democratize your data access, and allow for advanced analytics tools like AI and machine learning, to put actual data insights in the hands of the decision-makers. Benefits include the ability to optimize market initiatives, streamlining of supply chains, better management of shipping logistics, forecasting of sales trends, design, and delivery of better products.
Lastly, just a little plug for Redapt: We're a premier partner for business transformation, serving thousands of clients and migrating millions of users to the cloud. Our capacities span the depth and breadth of today's IT, from consulting to world-class support. No matter where you are in your data migration, we have the experience and deep expertise you need to meet your objectives and realize the best ROI of your investment. We'd love to get in touch with you.
Speaker 1: 36:01 Awesome. Thank you, Chris. I did get a couple questions sent directly to me. I think we have a little time to cover those, so I'll just kind of fire off and, if it makes sense, just answer as best you can. First one is when it comes to outsourcing - outsourcing modernization is a little bit scary. I can empathize with that. Because we sell consulting services and engineering, but I'm also a consumer of consulting services, and how does Redapt help? Obviously we're helping with the expertise and the implementation, but how do we help with knowledge transfer?
Christof von Ra...: 36:54 That's great. One of the things is, we can offer workshops where we come on site and you grab the people that you need to learn this. We can use the trainer sort of approach, where you bring the interested parties, the heads of interested parties together. We walk through what's happening and identify... So, for example, we've got a BigQuery optimization workshop that we can deliver. Organizations decided they're all in. They start using BigQuery and they start seeing large, BigQuery bills.
The first thing they come and ask Redapt is, “Geez, how do I improve my cost billing in BigQuery?” We can come in and walk through the best practices in running a BigQuery request. And, how do you reduce those?
We can also come in and give a workshop on the various different tools and uses. The other thing that our support systems offer is: we are there. We're there for you as an organization. When I provide support to someone, I'd much rather teach a person how to fish than give them a fish. Because, I've got plenty on my platter. If I teach someone how to fish within their organization, they're not going to come back with that same question, or someone else within the organization won’t come back with that same question.
Speaker 1: 38:26 Yeah. Cool. Well, that's great. And I know as an organization we strongly believe that high-performing companies kind of develop this expertise, and we think it's our role just to accelerate that process. Like you said, we've got enough projects to do that. We want to enable our customers to be successful on their own. Here's another one. So at a high level how do you balance data democratization with governance?
Christof von Ra...: 39:03 That's the million dollar question. First of all, you have to start with: are there constraints, are there governance constraints? So, is there information that's PII, that we can't let people have access to? Or, if we have that data, who has access to it? And, that's one of the benefits of having a solid data governance system in place. We can identify that this CFO and this CTO can have access to that data. No one else on their team can.
There's also obfuscation layers. With Data Catalog you can identify data as being email type data, for example, and blank out the email address. That blanked out email address might only be presented to a certain subset of users. Another subset of users might see all of that email address. That's one of the importances of having a solution like Data Catalog, so that you can identify the data and who should or should not have access to specific types of data. Then, it's fairly simple to put constraints on the visibility to that data.
Speaker 1: 40:29 Okay. There's one more here. I think we have still got a little bit of time. So, "We're considering moving from one cloud platform to GCP. What's the level of effort in moving a data lak?." You don't need to get too deep, because we don't have all the details, but, at a high level how would we approach that? And is that small, medium, large 30 days, six months? What kind of effort is that?
Christof von Ra...: 40:58 It's just a classic case of it depends. For example, if the data lake is primarily in MySQL someplace, it's fairly simple. Google Cloud, the Cloud SQL interface has a migration tool, a badabing badaboom, you just point and click and in it comes, you set replication up, and the day you decide you're going to make Cloud SQL your master, you turn it into the master and you're off and running.
But, I think one of the challenges that we run into most of the time with this is, it's fairly simple most of the time to do a lift and shift from a current environment to Google Cloud. What is not simple is to take that external solution and leverage the strength of what Google Cloud offers. So pound for pound, it's not very difficult. It depends on the size of the project, but it could be six weeks. It could be a three or six month process to get everything over and running.
But, if you really want to leverage the cloud, the Google Cloud and the managed services that it provides, we might end up looking at restructuring your entire system. A good example of this is one of the projects I worked on was an orchestration transition to Google Cloud. And, their orchestration was hand-built. It was built in Python and it was thousands and thousands of lines of code and multiple data streams.
When we took a look at it and we put it into Composer, when we actually stepped back and looked at the design pattern of what they were trying to accomplish, I was able to reduce that thousands of lines of code into less than 500. We only ended up with six workflow streams. The efficiency of that was tremendous. It took a lot of effort. It took a lot of lift and it took a lot of coordination to make sure we were getting out what was being put in their original solution. But, in the end, they had a much more robust solution.
Speaker 1: 43:27 Awesome. Awesome. Well, thank you so much for presenting. I love how you added a lot of color and details to the bullet points. Clearly you know what you're doing and I hope our audience walks away feeling like that, too. So, I appreciate it.
Christof von Ra...: 43:47 Yeah. Thank you for the time.
Speaker 1: 43:49 All right. Thanks for attending everybody.