In this webinar, Redapt Google Cloud Architect, Christof von Rabenau, discusses how your enterprise can build a modern data platform in order to leverage advanced analytics capabilities.
Video transcription:
Let me introduce our presenter. It's Christof. He's had about 30 years of experience in data and analytics. He has all the GCP specializations, obviously a lot of experience prior to GCP, and he works on quite a few of our most critical accounts in terms of helping them with their maturity in terms of how they handle data and prepare it for analytics. He's really good at going through that and finding insights for our customers and producing a platform that can generate value for years. I'll turn it over to Christof.
Thank you. We're going to start with, what are the questions that you might have? Why are you here on this webinar? Why would you be watching this webinar? And start with maybe you're here to learn what makes a modern data platform and trying to understand how it will help you gain new insights from your data, and how is it different than something that you already have in place, or maybe you're here to learn how to build that modern data platform and what are the solutions that Google Cloud offers in order to implement that and what skills are needed in your organization to make that happen, or maybe you're here more to learn about migrating your enterprise to that modern data platform and the things that you need to consider for a successful migration and what are the best practices to keep in mind and how do you choose a partner to help with that implementation? We're going to dig into that.
Our agenda today is to have an introduction, talk about what is this modern data platform, look at some examples, look at blueprints and then identify some of the key areas of benefits that you might see as an organization. Then we'll move into building a modern data platform. What are the general components? What are the best practices? What is the process of building that look like? What does the modernization mean for your team? Lastly, we'll move into the migration aspect, why you would choose Google Cloud as a platform solution and how to choose a partner and then working with GCP and Red Hat.
Let's dig into this modern data platform. What is a modern data platform? Well, a modern data platform allows you to leverage the scalability of the Cloud. By scalability, we're not just increasing the computer processing power, memory and storage, but we're also instantly adding additional machines to the problem. We're responding to peaks and valleys of resource demand. We're adding compute power as needed, and we're removing compute power as demand decreases. A modern data platform also ensures proper security and governance are in place. Security is now a shared responsibility, but with powerful allies, these allies are assisting you with compliance regimens like HIPAA, GDPR, PCI DSS. They're also assisting with auditing, identifying who did what, when and where.
A modern data platform also democratizes data access. With strong governance built in, you put data in the hands of the decision makers, people who can actually take action on the insights and eliminate the impression IT as gatekeepers making arbitrary decisions. Lastly, a modern data platform employs advanced analytics tools like artificial intelligence and machine learning. Suddenly we're able to see insights that we rarely find sifting through with a human scale.
As we move on to examples of modern data platforms, let's take a look at what a blueprint looks like. Well, modern data platform is automated. With the speed and scale of data that is coming in today, we can't wait for a human to intervene to manually trigger an event. These workflows have to happen automatically. It also, in regards to automation, needs to be repeatable and reliable and return the same result for the same set of problems. It needs to be advanced analytics ready. Right now, the speed of analytics can't be hampered by finding the right plug to put in the right socket. It also has to be flexible and responding to the change of the speed of new insight. Lastly, it has to be governed and secure. I can't stress enough that your data is one of your organization's most valuable assets, and there's a responsibility to protect that for the organization, but there's also responsibility to protect that for the users of that data. If I'm submitting data to your organization, you've got a responsibility to protect that data.
Let's poke at a couple of examples. This year is a sample build out of a recommendation engine using the Google Cloud Platform. You can see on the left hand side, we've got data sources coming in. We've got inventory data, we've got purchase data, we've got wishlist and we've got reviews. That data is coming in from Cloud SQL. Cloud SQL is Google's managed database service that makes it easy to maintain, manage and administrate your own relational database. It supports MySQL, it supports PostgreSQL, and now also supports SQL Server.
We also have Cloud Datastore with a managed scalable NoSQL database focused on high availability and durability. Those data sources then are feeding into our ETL process, where we're transforming and enriching the data. Here, we're using Cloud Dataflow, which is a fully managed data processing service, automatically provisioning and auto-scaling, and it's reliable and consistent. Out of that transformation, that data gets stored in Google Cloud Storage, which is secure, durable, high availability, low cost storage. That data then is accessible for our machine learning tools. In this example, we've got Cloud Dataproc, which is a fast and easy and fully managed service for Hadoop and Spark. We've got machine learning and prediction APIs. These are hosted solutions to run training jobs and predictions all at scale. Do you need to worry about the underlying infrastructure?
Lastly, we've got big query in our analytics. Here, we've got a surplus, highly scalable, cost-effective data analytic storage that can process petabytes of data in small amounts of time with zero operational overhead. I can't stress enough the value of big query in analytics. This is one of the crowning pieces of the Google Cloud environment. Below that we've got our applications. This is our presentation layer to the customers. We've got shopping cart and browsing and outreach, and all of these can be bundled within that platform as well. Here's another example of architecting on the platform. This example is using Internet of things and using sensor stream ingestion. On the left hand side, we've got devices sending data in to our gateway. That data is then fed into our ingestion portion of the platform.
We've got Cloud Pub/Sub. Cloud Pub/Sub is a message queuing solution. There's monitoring built in, there's logging built in to this ingestion pipelines. Cloud Pub/Sub cloud is then the automated trigger to starting the workflow in Cloud Dataflow and Cloud Dataflow in this example is sending information to storage, in Google Cloud Storage and Datastore, which we've already talked about. Cloud Bigtable is a massive, NoSQL data solution. Again, unmanaged solution, and we've got our analytics. We've got Data0flow, BigQuery, Dataproc which we've touched on and then we've got Datalab, which is Google Cloud's implementation of Jupyter lapse. Again, from here that flows into our application and presentation layers.
It's one of the benefits that your organization might see, ability to optimize marketing initiatives. I've touched Google Cloud for marketing solution. This is a 360 degree view of your customers. From sales data, from customer relationship database, from social media, from their click ads. All of that data is brought into a solution where machine learning gives you actual insights to find lifetime value of your customers and target your marketing initiatives. It's a fantastic product. You can also streamline your supply chains using data-driven predictions on lead times for goods and better management of shipping logistics, forecasting of sales trends using predictive analytics and delivering and designing better products. All of these are things that you can see when you start leveraging this modern data platform.
What does building that platform look like? What it's going to take for your organization to get into that place of success. Well, first let's poke at what the components are. We've talked about, overview of what it looks like, but when we dig into the real nuts and bolts, you're going to have a storage solution and that storage can be in the Cloud fully, or it can be a hybrid solution, but storage is critical. You need to be able to identify where that data's going to go, how it's going to be stored, and the constraints that you have on that data. Those constraints, may be why you have a hybrid solution, because you may not have the ability because of governance or maybe licensure. You might not be able to move the data to the Cloud. It may need to still remain on premise partially.
Your platform's going to have an ETL or an ELT data system. There, we're using this to create reproducible results with automated orchestration. The solution has to have open data access, which allows visualization tools and self service business intelligence to reach all levels of the organization and machine learning and AI can pick up that data and automatically produce valuable insights. You're also going to are going to see virtual data consolidation in a modern data platform. This allows for data consumption, orchestration and analysis without ELT or ETL of the original source. Basically, you're bringing in a view of that data, sending it through your pipelines and never manipulating that original source. It stays pristine.
It's going to have a robust data indexing and security measures in place. We've talked about security and the value of your organization and the responsibility to maintain secure data. Indexing is going to give your organization a common language and improves the data governance and security across the organization. Lastly, modern data platform includes a data life cycle management solution. This means automated deprecation of versions of software. It means storing data into archive or deleting data based on rules that you create and the data life cycle management then takes over and operates.
If we take these components and stick them on top of Google Cloud's tools, we can see here starting on the left hand side, we've got capture, and we've talked a little bit about data ingestion with Cloud Pub/Sub. We've talked about the Internet of things coming in, data being streamed in. We also have data transfer service, which is Cloud to Cloud or Bucket to Bucket transfer of data, or from On-premise to cloud. There's also a storage transfer service. This is for transferring large scale data to GCP. Sometimes sending it over the wire isn't the best way to do this. We have to implement different mechanisms for massive scale of data and then we need to process that. We have ELT or ETL processing.
We've talked about Cloud Dataflow a little bit, and we've talked about Cloud Dataproc. Well, there's also a tool called Cloud Dataprep. This is a tool for visually exploring, cleaning and preparing your data for analysis and machine learning. This is something to put in front of your nontechnical staff. They can take a look a CSV file or an Excel spreadsheet, and identify here are the columns of interest, here's the way that this column needs to be manipulated and they don't need to learn programming to do this.
Then we go into the data Lake and data warehousing. We've got Cloud Storage and we've got BigQuery Storage. I'm going to circle back to that in a second after I touch on the fact that the BigQuery has an analysis engine, you see the analysis happens separately from storage. BigQuery, one of its strengths is that it separates storage of data from the actual analysis of the data. Google can increase the compute power as needed for a large scale analysis and that's how it responds to these petabyte requests. Lastly, we've gotten use, we've got advanced analytics where we can use Cloud AI services, and TensorFlow is a machine learning tool. Google Data Studio is Google's implementation of data visualization. It's a web based interface very much in the order of Tableau or Looker or Power BI and that your users can connect to the data that you have stored within your platform and visualize and gain insights. Most people aren't aware of the fact that Google Sheets can connect to a lot of the data solutions and that are stored in Google Cloud.
If we look at that top row, then we've got a line beneath that with Cloud Data Fusion. Cloud Data Fusion is a mechanism for connecting to disparate sources of data. This is where that data virtualization piece comes in. There are 150 plus preconfigured connections and transformations that can just automatically be plugged in. They're just part of the package of Google Cloud. Then we've got Data Catalog underlying all of that. Data Catalog is our mechanism for managing information and identifying our resources and Cloud Composer is the bottom layer of that and that's where we do orchestration.
Let's take this and plop it into a big data event processing solution. Here on the left hand side, we've got streaming input and we've got batch input. That streaming input comes into Cloud Pub/Sub, which is the messaging service. That messaging service says, " Hey, I have new information, sent it onto cloud data flow." That Cloud Dataflow then processes that data, sends it into Bigtable for further analysis. That batch processing brings that data over into the ETL system. Again, via data flow, feeding that data into big table. From there where we're feeding that data into the rest of our solution, where we've got analysis tools and reporting and pushing out to our mobile devices.
When we talked about moving to a modern data platform, clearly data is the key. We have to have some best practices in mind when we're looking at what we're trying to accomplish. We have to have the right amount of storage. We have to know what our data sources are. The data is cleansed and optimized and security and governance are in place as the data arrives and is processed appropriately. Our challenges are, how do we do this? Well at Red Hat, we tried to talk about the four V's of your data. The first V is velocity. How fast is that data arriving? What is the cadence of my batches? How often can I expect updates of that data?
The second V is volume. How much data am I receiving? That's going to directly impact the kind of storage that we need to set up. How much data? Is it Kilobytes of data at a time? Am I getting megabytes of data? Am I getting gigabytes of data. The third V is variety. What kinds of data am I getting? Do I have a single source of truth in this data? Do I have multiple sources coming in? Do I have social media feeds merging with log files, merging with ad click data? The last bit is the veracity of that data. How clean is it? What do I need to do in order to get value from that data? We've got four V's, velocity, volume, variety, and veracity, and that's going to have an effect on choosing the right solution.
Now that we know what our best practices are, what's the process of getting there? First, is identify and clearly understand your technical maturity. This is not a knock on your organization. It's about being honest, so that everybody's on the same page. A mature organization is agile, they're adaptive. They can rapidly scale up or down and shift operations. They're also innovative. Now this last several months has probably done a really good job of testing your organization's maturity. If you didn't hiccup, when all of a sudden your workers had to be home workers, then you're probably a good mature organization. If you were scrambling putting solutions into place, trying to figure out how to do this, you probably have work to do there. This is not a knock. It means you just have opportunity for change.
Once you identify your current capacity, then you can identify the goals and modernization, and you have to make sure that there's agreement throughout the organization on what those goals are and everybody's looking in the same direction. The clarity of voice, that hype cycle of, "Oh, we're so excited about this modern data platform." and then delays incur and things start falling apart. Then you start hearing words like your system and my tool instead of ours.
The second step is data assessment. You have to identify what data you have, where the data is coming from and all gaps in your data. When I say, what data do you have? It's not just the data that you know about, it's the data that you don't know about or not paying attention to. It's estimated that 90% of the world's data has been created in the last two years. There's data like log data. We all know about social media data, but there's social media embedded data. There's all sorts of data your organization has access to, but may not actually know about. I can speak to an example of the any and all gaps and where the data's coming from. At one point I was working on a solution and the organization had terabytes of aggregated product data. This data was aggregated at the weekly level, and it was aggregated at the day level.
The question came up, not just can we aggregate this at the hour, but also we needed to add some enrichment to that data that was only in the raw files. My first question was, "Well, where's the raw data." I had to talk to four different people within the organization before someone actually even knew what I was referring to by the raw data. They kept pointing to, "Well, here's the data. This is the data that we're using." Once we found the raw data, it was in archive storage, and we had to rerun significant processes just to get access to that data. You need to know where that data is, you need to know where are the gaps in the data that you actually have.
Step three is looking at Cloud adoption, and this is deciding what workflows belong in the Cloud and deciding whether or not you're going to be a fully Cloud solution, or if you're going to be hybrid solution. We've talked a little bit earlier on why it might be a hybrid solution. You might have software, that's not conducive to moving to the Cloud. Either because the resources can't be reproduced, or you may have licensing constraints and been putting it to the Cloud. You also might again have governance constrictions on moving some of your data to the Cloud. You would end up being a hybrid. Once you identify the workloads that are suitable for the Cloud, then we can identify which Cloud or Cloud provider or providers to partner with. You may find that one Cloud provider does something better than another one does. In that case, you might have multiple providers. I can speak to every project that I have worked on at Red Hat so far has been a multi-cloud solution.
Once we've identified the maturity of the organization, what data you have, what workflows are appropriate for the Cloud, then we can start looking at what advanced tools we can implement for predictive analytics, artificial intelligence, and machine learning. Some examples of artificial intelligence that Google offers is there's Visual AI, there's Sentiment Analysis. Sentiment Analysis is looking at your chat bot and determining whether or not the customer that was chatting on the chat bot was annoyed. There's a Translation AI and Translation AI is pretty incredible. Now it can actually take an image that's in a foreign language and output a translation.
Lastly, in the building out the platform, what does modernization mean for your team? Modernization is a shift. There's no question that organizational change can be difficult. For IT, one of the largest challenges is understanding their role as no longer managing hardware and software, but shifting to governance and visioning, thinking what's the possibilities rather than what are the limitations. IT teams that are aligned with business become way more valuable to the organization and business when they recognize that IT is working within a partnership with them, the questions can become, what do we need? What can we do instead of saying, "We don't have the resources for that, or there's no way that I can set that up in time."
Now, we get to the migrating to a modern data platform portion of the presentation. I'm here to talk about Google Cloud and its platform solutions. I'm sold on Google Cloud. That's why I've spent the time that I have in becoming knowledgeable. Some of the benefits of the Google Cloud Platform, without question, is their leadership in Kubernetes. They are the original developers of Kubernetes, and by far have the best Cloud implementation of Kubernetes and Cloud containers. Google also has some very innovative pricing when it comes to virtual machines. The virtual machines are highly customizable. You can customize CPU and RAM and disk and type of disc and GPU, independently of each other. Yes, there are standard types that exists. You can point and click and say, "Just set me up with this type of a virtual environment." but customization is simple and easy to implement.
This stands in opposition to other Cloud solutions where if I need to increase my CPU, I have to increase my RAM. If I increase my RAM, I have to increase my CPU. Or if I choose a specific disc type I have boxed in and have to pay a specific pricing. Google is also very innovative in their billing in pay per second billing. One of the objectives here with our Cloud workflows is we will bounce up a cluster, run the workflow and deprecate the cluster. You pay only for the time that that workflow is actually in place and operating. It's not by the hour, if not by the quarter hour, not by the minutes. It's down to the second.
Google also has automatic discounts on long running workloads. This is not something that you have to call up Google and say, "Hey, I've got this long running cluster. I've got a cluster this up 24/7. It's just there, it's doing its job because there's so much data we have to process." The Google billing system recognizes that that cluster is up. It recognizes that it's on and it recognizes that it's working and automatically will begin to discount the cost of that workload.
Google cloud also offers custom image types for creating instances of specific needs. Yes, there's images for SQL server and there's images for an Ubuntu server and what have you, but the ability to create a custom image type for your specific use case is really important, especially as you start developing tools that aren't necessarily a normal package tools. You can create your Ubuntu image that has X, Y, and Z built into it. You don't have to rebuild that image. You don't have to rebuild the VM every time you start one up, you just point at that image and up, it comes ready to go.
As we've talked about life cycle management is critical to a modern data platform. Google has the obvious things like auto deletion or auto deprecation. But one of the things that is really unique about Google's lifecycle management is changing the storage class of an object. By storage class, I mean, if Google has various different pricing, depending on how regularly you access specific data. If this is data that I access on a daily basis, you'd want to have it in this specific class and you pay more for that regular access. If you only access that data once a quarter, then you can put it in a different type of class and you pay less for that. Then There's cold, and then there's archival. Those exist in other Cloud solutions, what's unique about Google's implementation of lifecycle management is you could change the storage class of an object, right within the same bucket, which is equivalent to a hard drive as all the other files that you're operating with. You're going to have to move that file to a different storage bucket. Within that storage bucket, you can change the storage class.
Lastly, and I think this is like to me, one of the unappreciated values that the Google Cloud environment brings. Is the user centric interface. The GUI itself is very intuitive, but also the command line interface is extraordinarily intuitive, as well as the interface between all of the resources that exist in Google Cloud. My first project, coming on to Red Hat, was a data migration project from AWS to Google Cloud. It seems like a pretty straightforward problem, but it turned out to be a little more of a headache than the surface looked. I needed to go into the AWS environment and create a compute instance that would allow me to do specific things. I was mind boggled by the hurdles and the barriers that were in place and the lack of intuitive nature of the AWS environment, when I was so used to the Google Cloud environment, where I bounced up an instance, it automatically was loaded with network connections, I could go into that network connection, I could set my firewall rules and ta-da, I was finished. That was not my experience in AWS.
Here we are, you've decided you're going to the Cloud. You've decided on your Cloud solution. How do you find the right partner? Well, a modernization partner should provide you with a Cloud Assessment and that Cloud Assessment will determine the capacity and technology best suited for your organization. They're also going to provide you a Cloud Adoption Framework, so you can assess for yourself what services are most beneficial to your organization. They're going to assist in navigating cloud migration challenges. The transition to the public cloud won't lead to disruptions or downtime. They're going to help you with implementation of best practices to address security, compliance, and governance. These are all core practices at Red Hat. In addition, we have teams dedicated to application modernization, modern data center, advanced analytics and emergent technology.
In conclusion, you look at a modern data platform that's just going to leverage the scale of the Cloud, provide proper security and governance, democratize your data access, allow for advanced analytics tools like AI and machine learning to put actual data insights in the hands of the decision makers. Benefits include ability to optimize market initiatives, streamlining of supply chains, better management of shipping logistics, forecasting of sales trends, design and delivery of a better product. Lastly, just a little plug for Red Hat, we're premier partner for business transformation, serving thousands of clients and migrating millions of users to the Cloud. Our capacities span the depth, breadth of today's IT from consulting to world class support. No matter where you are in your data migration, we have the experience and deep expertise you need to meet your objectives and realize the best ROI of your investment. And we'd love to get in touch with you.
Awesome. Thank you, Chris. I did get a couple questions sent directly to me. I think we have a little time to cover those. I will just fire off and if it makes sense, just answer as best you can. First one is, when it comes to outsourcing, outsourcing modernization is a little bit scary. I can empathize with that because we sell consulting services and engineering, but I'm also a consumer of, of consulting services. It's like, how does Red Hat help, not only with obviously we're helping with the expertise and the implementation, but how do we help with knowledge transfer?
Ah, that's great. One of the things that we can offer is we can offer workshops where we come on site and, and you grab the people that you need to learn this. We can use a trainer, sort of approach where you bring the interested parties, the heads of interested parties together, we walk through what's happening, identify... For example we've got we've got a BigQuery optimization workshop that we can deliver. Organizations decided they're all in, they started using BigQuery and they start seeing large, BigQuery bills. The first thing they come and ask Red Hat is, "Geez, how do I improve my costs billing in BigQuery." We can come in, walk through what are the best practices of running a BigQuery request and how do you reduce those.
We can also come in and give a workshop on the various different tools and the uses. The other thing that our support systems offer, is we are, we're there for you as an organization. When I provide support to someone I'd much rather teach a person how to fish than give them a fish, because I've got plenty on my platter. If I teach someone how to fish within their organization, they're not going to come back with that same question or someone else within the organization coming back with that same question.
Yes. Cool. Well, that's great. I know as an organization, we strongly believe that high performing company developed this expertise and we think it's our role just to accelerate that process. Like you said, we've got enough projects to do that we want to enable our customers to be successful on their own. Here's another one. At a high level, how do you balance data democratization with governance?
That's that million dollar question. First of all, you have to start with, are there constraints? Are there governance constraints? Is there information that's PII that we can't let people have access to, or if we have that data who has access to it? That's one of the benefits of having a solid data governance system in place is we can identify this CFO and this CTO can have access to that data, but no one else on their team can. There's also obfuscation layers. With Data Catalog, you can identify data as being an email type data, for example, and blank out the email address. That blanked out email address might only be presented to a subset of users. Another subset of user might see all of that email address. That's one of the importance of having a solution like Data Catalog, so that you can identify the data and who should or should not have access to specific types of data. Then it's fairly simple to put constraints on the visibility to that data.
Okay. There's one more here. I think we still got a little bit of time, so we're considering moving from one Cloud platform to GCP, what's the level of effort in moving a data lake? You don't need to get too deep cause we don't have all the details, but at a high level, how would we approach that? Is that a small, medium, large, 30 days, six months, what kind of effort is that?
It's the classic. It depends. For example, if the data lake is primarily in MySQL someplace, it's fairly simple. The Cloud SQL interface as a migration tool, BadaBing BadaBoom, you just point and click and in it comes and you set replication up and the day that you decide you're going to make Cloud SQL your master, you turn it into the master and you're off and running. But I think one of the challenges that we run into most of the time with this is, it's fairly simple most of the time to do a lift and shift from a current environment to Google Cloud. What is not simple to do is to take that external solution and leverage the strength of what Google Cloud offers. Pound for pound, it's not very difficult. It depends on the size of the project, but it could be it could be a six week, it could be a three or six month process to get everything over and running. But if you really want to leverage the Google cloud and the managed services that it provides, we might end up looking at restructuring your entire system.
A good example of this is one of the projects I worked on was, an orchestration transitioned to Google Cloud. Their orchestration was hand-built. It was built in Python and it was thousands and thousands of lines of code and multiple data streams. When we took a look at it and we put it into a composer, when we actually stepped back and look at the design pattern of what they were trying to accomplish, I was able to reduce that thousands of lines of code into less than 500. We only ended up with six workflow streams. The efficiency of that was tremendous. It took a lot of effort. It took a lot of lift and it took a lot of coordination to make sure that we were getting out, what was being put in their original solution but in the end they had a much more robust solution.
Ready to learn more? Read more: The In-Depth Guide to Adopting and Migrating to the Cloud
Categories
- Cloud Migration and Adoption
- Enterprise IT and Infrastructure
- Data Management and Analytics
- Artificial Intelligence and Machine Learning
- DevOps and Automation
- Cybersecurity and Compliance
- Application Modernization and Optimization
- Featured
- Managed Services & Cloud Cost Optimization
- News
- Workplace Modernization
- Tech We Like
- AWS
- Social Good News
- Cost Optimization
- Hybrid Cloud Strategy
- NVIDIA
- Application Development
- GPU