API Updates Shine in Apache Spark 2.0 Release

Of the many new features that have arrived with the 2.0 release of Apache Spark, the API updates may be the headliner. Yes, Apache flattened the Lambda architecture and improved performance, but the new Structured Streaming API and the consolidation of the DataFrame and Dataset APIs have caught the most attention. As the first release in the 2.x line of Apache Spark, many have declared that the compute engine has matured greatly in its usability, simplicity, and functionality.

The Structured Streaming API constitutes a higher-level API that enables the building of continuous applications. Continuous applications include end-to-end streaming applications that integrate with storage, serving systems, and batch jobs in a fault-tolerant manner. The API is built on top of Spark SQL and the Catalyst optimizer. Currently, the Structured Streaming API remains in an experimental release.

Additionally, Apache has consolidated the DataFrame and Dataset APIs. Now, DataFrame is simply a type alias for Dataset of Row. The consolidated API unites the batch and compute processes into a single process. The unity reduces clutter, and simplifies the environment.

At a high level, Apache highlights the major updates as API usability, SQL 2003 support, performance improvements, structured streaming, R UDF support, and other operational improvements. Further, the new release includes over 2500 patches. Keeping with its open source foundations, the patches came from over 300 contributors. For more details, check out the new release docs

Be sure to read the next Big Data article: Will the API Kill the Data Scientist?