Yahoo's Open Sourced S4 Could be a Real-time Cloud Platform

Yahoo! S4In a world where real-time data streams are becoming much more common, and with the volume of that data continuing to increase, it makes sense that a Framework would be developed to increase the ease at which that data can be processed. Yahoo! S4 isn't the first such framework to be concieved, or even open sourced, but it is likely to massively increase awareness that such frameworks exist, what problems they may help solve and get developers thinking about how they could use the technology and potentially increase the likelihood of somebody moving S4-like capabilities into the cloud and offering it as as service.

The requirement for a "distributed stream computing Platform" came about for Yahoo! in order to be able to process thousands of search queries per second, from potentially millions of users per day,  to facilitate the generation of highly personalized adverts for web search. A new framework was required because Yahoo! felt that MapReduce, which is commonly used to process large datasets in batch jobs, was "hard to apply to stream computational tasks".

Yahoo! describe the S4 framework using a number of terms that have become common place in the world of cloud computing:

S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.

Exactly what Yahoo! S4 is, and what it is capable of, has been discussed in a number of other places. The most commonly used term by comparable frameworks is Complex Event Processing with applications including filtering, correlation and pattern matching. These discussions will no doubt continue but ultimately a framework is something that can be put to multiple uses which is why Yahoo! chose to call it "general-purpose".

Yahoo! have created a couple of examples to demonstrate some of the basic capabilities and clarify what S4 can do. One of the examples recieves data from the Twitter real-time Garden Hose stream, counts the number of times a hashtag is mentioned and keeps an ordered list of the most commonly mentioned hashtags. Each step of the process is performed in what Yahoo! are calling Processing Elements and it's these elements that enforce the separation of each logical step of the process (e.g. recieve update, extract hashtags, count hashtags, order hastag count list)  and allow the execution of the process to take place on a distributed system.

One potential thing holding S4 adoption back is that as yet it's not offered as a service. As well as writing their own Processing Elements developers will have to host their own distributed stream computing platform. If S4 proves to be a useful and popular framework then we may start to see hosted distributed stream computing platform services in the same way that we've already seen MapReduce being offered as a service by Amazon.

Yahoo! S4 is yet another powerful real-time component now available to the Programmable Web. It opens up a number of possibilities for developers to start building exciting data-centric applications, mashups or hosted services which could integrate with other components such as real-time APIsreal-time client push services and DaaS services.

Be sure to read the next Cloud article: Lessons from Cloud Storage APIs Help Sync Your Life