Speed and efficiency are always important when it comes to data management, but this is especially true when it comes to Uber. Uber processes data for millions of trips in over 400 cities in 60 different countries on a daily basis. Needless to say, that’s a lot of data, and finding a way to optimize their JSON payloads was an important goal for the Uber engineering team. Lucky for us, Kare Kjelstrom, engineering manager at the Uber office in Aarhus, Denmark, has shared the process through which Uber was able to streamline their data management protocol.
In the beginning, Uber managed their trip data via JSON blobs, with each trip requiring about 20 kilobytes (KB) of data. Assuming Uber processed raw JSON data for exactly 1 million trips a day, 32 terabytes (TB) of data would only last them 10 days (provided ~40% of storage is reserved for system components). Clearly, such a model was unsustainable. So, the engineering team began to explore alternatives. Kjelstrom explains that the initial intent was to, “squeeze the data without sacrificing performance. Reclaiming several KB per trip across hundreds of millions of trips during the year would save us lots of space and give us room to grow.” To achieve this, Kjelstrom planned to combine a binary encoding protocol with a compression algorithm. All of which begged the question, which protocol and which algorithm?
To answer this, the engineering team set out to benchmark potential protocols and algorithms to find the winning combination. Uber selected 10 different encoding protocols (Thrift, Protocol Buffers, Avro, JSON, UJSON, CBOR, BSON, MessagePack, Marshal, Pickle) and paired them with 3 different lossless compression algorithms (Snappy, zlib, Bzip2) - resulting in a total of 40 viable solutions. The encoding protocols included those that use IDL (requiring schema definitions) and those that do not, each presenting respective strengths and weaknesses in terms of flexibility, validation, and formatting. To test their solutions, Uber wrote scripts to process data from roughly 2,000 trips from Uber New York City, focusing on the speed of “encoding/decoding, compressing/inflating, and the gain or loss in size” of data. Here’s what they found:
This graph plots the candidates in respect to size and speed with the lower left-hand corner being the optimal position (i.e. the fastest processing speed with the smallest data footprint). The “Pareto efficiency” line is shown in red, representing the best choices in terms of striking a balance between speed and size. Based of these results, the team whittled down their list and reexamined their top contenders. For pragmatic reasons, IDL-based encoding protocols were eliminated so as to avoid defining schemas, resulting in the following list:
Marshal was eventually cast aside because it only used Python, and JSON zlib because it is clearly larger than the rest. They chose to exclude UJSON despite its speed due to its slightly larger size leaving CBOR and Messagepack of which they settled on the latter.
In conclusion, optimizing the speed and size of your data storage is inherently good. As David Berlind discuses in his article on optimizing APIs, failing to optimize your data management can cost money, time, and even sales. An extra few bites here and there might not seem significant, but on a larger scale proper storage protocols can have immense impacts on business and the bottom line.