When it comes to public cloud services, Google trails both Amazon’s Amazon Web Services (AWS) and Microsoft Azure in market share. With its rich experience in collecting, organizing, and processing vast amounts of data over the last decade and more, Google knows that its best bet lies in providing fully managed cloud services that help organizations process Big Data. Recently, it announced General Availability (GA) for two key cloud services in its Big Data portfolio: Cloud Dataflow and Cloud Pub/Sub.
Cloud Dataflow is a fully managed cloud service that allows for real-time streaming and batch processing of Big Data. The focus of Cloud Dataflow is to allow the developer to define data processing pipelines via a unified API and not worry about the infrastructure and provisioning behind it.
Cloud Dataflow provides a unified programming model via which you can create a workflow to ingest, process, store, and then analyze your data. Not only does it integrate well with other services across the Google Cloud Platform like Google Cloud Storage, Big Query, and Cloud Pub/Sub — Google has announced integration with various other service partners as well. An example is Salesforce Wave Analytics in which Cloud Dataflow was used to aggregate, transform, and enrich data that came in and then this data was fed into the Wave platform where users could then analyze/visualize the data. Other partners who have announced deep integration with Cloud Dataflow include Cloudera, Tamr, and more.
The Cloud Dataflow SDK, currently available in Java (with a Python API planned later), is used to model key building blocks that make up your data processing Pipeline. A Pipeline is basically what the Dataflow service will execute for you. Each Pipeline can consist of a series of Transformation Steps which can take a collection of objects (input) and then produce an output. The Pipeline is bounded with I/O data sinks, where the input could be various input sources like Google Cloud Storage, unlimited streaming data coming in from services like Cloud Pub/Sub, and more. The output could be fed into Analytics services, not just from Google but other vendors that have integrated it into their Open SDK. Check out the Programming Model for more details, including samples to get started.
If you look at the above Cloud Dataflow model, it is obvious that it would need to be fueled by a service that is capable of handling millions of messages moving in and out of the system in an asynchronous fashion. This is where Cloud Pub/Sub in combination with Cloud Dataflow becomes a powerful combination. Cloud Pub/Sub, as mentioned earlier, is now in General Availability too, and provides a single unified API again to address large-scale messaging needs such as scalability, logging, and availability behind the scenes. The pricing is attractive too — at 5 cents per million messages.
The Cloud Pub/Sub API is now available in v1 and full documentation is available here. The API methods are broadly divided into two categories: topics and subscribers. Client libraries that wrap the API are available across multiple languages that include Go, Python, Java, C#, PHP, and more.
The need for organizations to process data in a way that is easy for them, and is open enough to allow them to use their own I/O sources and targets (applications), is going to be the key moving forward. Cloud Dataflow and Cloud Pub/Sub provide a compelling option for organizations today that are looking for a focus on large data processing rather than complicated setup and infrastructure management.