Google Cloud Dataflow Eases Large-Scale Data Processing

Google Cloud Platform is making a big push toward Big Data services. Google Cloud Dataflow has entered beta, and it provides a powerful big data processing Platform in the cloud. The key benefit that Google is promoting via Google Cloud Dataflow is that it helps the implementation focus on the programming and data analysis problem at hand, rather than worrying about the infrastructure and resources that need to be provisioned and tuned for optimum performance.

Google Cloud Dataflow is now available as a managed service on Google Cloud Platform. It brings together Google’s vast experience in building large-scale data processing platforms with MapReduce and other tools. The service allows users to define both batch and stream processing pipelines via a unified programming model.

Users can define their data processing jobs via the Cloud Dataflow SDKs. The purpose of the SDKs is to simplify defining a data pipeline via the Dataflow programming model, which comprises fundamental building blocks that help in data representation, data transformations, and reading/writing to a variety of formats and storage technologies. Google has open sourced the Cloud Dataflow SDK for Java. This SDK allows programmers to incorporate the Cloud Dataflow programming model into their applications. The data processing workload is then managed by the Cloud Dataflow managed service, which helps run the pipelines across various services on Cloud Platform like Compute Engine, Cloud Storage and BigQuery.

For those invested in the Google Cloud Platform, the Dataflow service integrates well with other services in the ecosystem, including Google Cloud Storage, Compute Engine and BigQuery. Check out the diagram below:

Billing for Google Cloud Dataflow will begin April 27. Users should consider not just the service's individual pricing but also other Google Cloud Platform services that will be utilized in processing the data pipeline — for example, Cloud Storage, Compute Engine, Networking and BigQuery.

As data analytics gains momentum, the use of managed services enables the data scientist to focus on the data analysis job and not worry about the infrastructure and scalable architecture that has often been a part of such projects. Google Cloud Dataflow presents a compelling option to build out large-scale data processing pipelines using Google’s best-in-class infrastructure.

Be sure to read the next Big Data article: Daily API RoundUp: Emailage, ASCII Gallery, Google BigQuery SDKs