Google Cloud Platform (GCP) - How to design data pipeline on gcp
So if you have not watched it, go check that out. But today we'll understand it in context with Google Cloud platform. So friends, on my left we have all the different types of desperate data sources which we have in any enterprise. We have real time data sources as well as we have batch type data sources. So what are real time real time data sources are the sources which are generating data very, very quickly in the matter of milliseconds.
And these events which are happening needs to be streamed somewhere. And in real time this data needs to be inserted. And this data is generally called as unbounded data. Why unbounded? Because there is no boundary to this data.
How much data will be generated for how long it will be generated? There is no boundary to it. It is real time. It can continue you to generate data every millisecond hour, every week, every day, every day of the year. So there is no boundary to it.
So this kind of data, what can be the example of it? Suppose an application. There is a front end application which tracks the user registration on your platform. So as soon as any new user comes in and sign up to your application and creates a login, there needs to be an event which needs to be triggered. Events could be countless because there could be millions of users coming to your website.
So this kind of data is generally considered as real time streaming data. Also sensor data. So suppose if you know there is some IoT sensor device which is generating the data every second, then those kind of sensor data also is considered as real time. So this kind of data can be can be streamed using Apache Kafka and this Kafka message broker. I have created a detailed video on Kafka, so to know the basics of it, you can check that video.
But to understand in a general way, Kafka is a message broking service wherein you have a topic, and on that topic there would be a producer and a subscriber so that producer of the event will continue to send the data and that data will continue to reside in this particular Kafka message queue. And then based on the subscribers who are subscribed to that particular message, those subscribers can pull the message as and when they want. Okay, so this is real time. Now, what is batch batch? I think you already know.
So it could be any traditional LTP system which is residing on Prem. And in this case I have taken an Oracle system. It could be used, a sales or inventory database, your marketing database, a very, very normal OLTP database considered to be an OLTP system. And that's why we have bounded data, batch data, or bounded data which we fetch from these systems, because there is a specific time and a specific limit to which we take the data and we take the data into chunks. That's why it is called as batch data, because we are taking it into batches.
Similarly, there are third party data sources as well. How? Suppose this company has multiple vendors and suppliers who want to send the data, maybe in CSV or Excel files. Those third party files need to be FTP to an FTP server and then it can be stored to be picked up. Now we discuss all the three.
Now, which particular technology does this whole job of fetching the data and Loading it software? I think we have understood it before. Also, it is extraction, transformation and Loading, which is called as ETL. And in our case, the ETL tools are plenty. But generally I have rated the three which I have worked on, so it could be other ETL tools also.
So Informatica, SAP Data Services, IBM Data stage. All these ETL products play the crucial role of fetching the data from the system at real time or beat batch data. And then it takes this particular data and then it loads it into your enterprise data warehouse. And inside your enterprise data warehouse, you could have multiple staging layers. So multiple layers.
So in this case it is a first layer would be staging layer where you will store your raw data completely raw data. And then there will be some transformation layer where you will apply the business logic. And eventually there will be a final Loading layer where you will load the data which you exactly want. And in this case, in our case, this is being done on a Tara database and the Teradata database is playing a role of our enterprise data warehouse. Once this whole cycle is finished, eventually what you have is you have your enterprise data warehouse where you have data from desperate systems and then based on your requirements as an end business user, or maybe a data scientist or maybe a data analyst, you will have different data Marts coming out of it.
So what is a data Mart? Data Mart is a very, very specific business specific entity which stores the data for serving a specific need. So this could be a whole enterprise data. But then you could have your specific sales data Mart which caters to the sales related dashboard or sales will related analysis which the business wants to do. Similarly, you can have data sets for your machine learning algorithms are to be formed and then you can have your operational reporting also.
So all those kind of different use cases can be derived out of this data Mart. So this is a very general, very, very simple way of seeing how we can have an ETL platform and an EDW design within On Prem. Again, very simple design. In normal world, it can be very complex, but we are understanding the concepts here. So I will keep it very, very simple.
So now we know how On Prem is designed. Now, if we take the same solution and try to match products within Google Cloud which will best suit this requirement and maybe provide some additional functionalities which we do not have in our Onprem solution. So what could be those products and what could be our solution to it? So let's design our Google Cloud based solution for the same enterprise and the same setup. So friends, now let's understand how we can create the same solution on Google Cloud platform.
As we discussed on the left, we have our realtime data sources and our batch data sources. But now in order to stream data real time data, we will use Google Cloud Pub sub which is and asynchronous message broking service which will create, as I said something, what Kafka does, it will create a topic and then there will be messages sent from the producers and then the consumers will subscribe to that topic and get those events stored. So this will be set from Google Cloud Pub sub. I will give the link of each of these products in the description below so you can go and export every product in detail.
Now comes a very interesting part which is how the batch works.
So for Oracle, it is very clear that we can directly affect the data. But for third party, instead of using FTP server, what we can do is we can use gsutil, which is a command line utility. For Google Cloud storage. Storage is an object based storage where you can store data, store objects of any format. It could be videos, it could be files.
It could be XML JSON. So Google Cloud Storage has massive capabilities, and it has 99 9999 9% availability. It's highly resilient. So Google Cloud Storage could be used to store these files into Google's Cloud storage buckets. And we can use GS Util as a command line utility.
Again, I'm just giving you.

0 Comments