How to Make Sense of Big Data through Data Visualization

In this age of Big Data, companies typically have no problem with amassing information; the issue is making sense of that data. Data visualization enables IT and business managers to get clear and varied views of data, allowing them to more effectively make strategic business decisions. Several big-name vendors, including Amazon and Google, offer big, expensive off-the-shelf solutions. However, there are also a number of HTML5- and SVG-based solutions emerging that make the data visualization process easier and clearer—and much less expensive.

In this report, we will focus on open-source software that can help organizations build their own solution for data visualization.

The Big Deal about Big Data

Big data comes from a variety of sources, ranging from social interactions to thermostats. As the amount of data produced and made available on the web grows larger and larger, structuring and visualizing it in meaningful (and human-friendly) ways becomes increasingly important. Indeed, companies are in something of an arms race to make sense of this flood of information because it can easily make the difference between understanding the market and going out of business.

While most companies easily cope with collecting data (from clicks, visits, purchases, application logs, and so on), not all can transform this information into knowledge (or, at least, valuable indications). The key point is that gathering terabytes of information is worthless if organizations don't have an efficient way to summarize and visualize it-- extracting meaning from mere information.

One of the issues is that skilled data scientists and analysts are in short supply. In fact, data analysis is something that companies should consider outsourcing to a provider that can demonstrate a history of success with big data analysis. IT giants have already entered the arena: Google (Analytics), IBM (Watson) and Amazon, just to mention a few, are offering free and enterprise versions of their tools. A number of startups have also successfully started businesses in this area, including Glow, SwiftIQ and Brand Networks (formerly know as Optimal).

There are, of course, also off-the-shelf solutions to easily build complex and interactive charts; a few of them are very good--highcharts, for example, or comprehensive analytics tools like Pentaho.

Where these tools usually fail is in the area of customization: if you need more flexibility, you'll probably want the kind of personalized approach that you can build yourself, or have it built by one of the startups mentioned above.

DIY and How: HTML5- vs. SVG-based Charts

Among the vast offer of libraries for data visualization, the first big distinction can be made between HTML5 libraries and SVG libraries.

HTML5 solutions use the Canvas object, introduced in HTML5, to draw the shapes (or images) in the chart.

SVG solutions, instead, uses the Scalable Vector Graphic format, a markup language derived from XML that is supported by all browsers. The current version is 1.2, but version 2 of SVG specifications is about to be released.

There are pros and cons to both SVG and HTML5 solutions. It may be oversimplifying things, but we'll sum things up in a few points:

  • Since SVG is a vector-based format, each drawn shape is an actual object added to the DOM. This means that the same shape or complex drawing can be ported to different devices and screen sizes without any noticeable loss of quality. In addition, because each shape is an object of its own, if an attribute of an SVG object is changed, the browser can automatically update the view by re-rendering the corresponding node.
  • Canvas is raster-based, so resizing a created image will produce pixelation or loss of details. In addition, to reposition a shape or change color, the whole scene needs to be redrawn. (However, layers can be created, such that only the layer containing a particular object needs to be redrawn.)
  • It is possible to add an event listener to SVG elements, but not for Canvas shapes. The behavior of mouse clicks, for example, must be associated with single pixels on the canvas rather than shapes.
  • On the other hand, when lots of elements are drawn to an SVG chart, each one is added to the DOM, even if it is not visible. In charts with thousands of elements to be shown, SVG solutions don't scale well.
  • Canvases are great for drawing raster images and create sprites; in SVG this is simply not possible.

All things considered, since you rarely will need to insert more than a few hundreds shapes in a single chart (and if you do, you might want to revisit your design), SVG has so far been the most widely adopted and used technology for data visualization. If you need Canvas solutions, however, ChartJs is a nice Library for charts, while FabricJs is a lower-level API that allows shape manipulation in Canvases. In addition, the brand-new p5.js is an amazing HTML5 library specifically for artists.

SVG Libraries
To test the merits of this category of products, we closely evaluated three SVG systems: D3, Raphael and Snap.svg. There are differences among the products, even in each of their respective categories, but our focus here is on the bigger picture.
There are, of course, dozens of SVG libraries already available, and new, more advanced ones are regularly created. (For example, Snap.svg itself has been recently launched.) However, the three systems we will evaluate closely in this report are the most widely used and, likely, the most complete solutions currently available.

D3.js is a data-driven, SVG-oriented visualization library created in 2011 at Stanford, and kept as an open-source project. As of today, it is probably the most widely adopted library for data visualization and the one with the highest number of plug-ins and extensions. D3 is also the first library to offer built-in data-binding, via JavaScript.

The main pros D3.js offers are:

  • Extensions or examples for almost any kind of chart you might want to create
  • Data binding native support
  • Very good and thorough Documentation

But there are also a few cons to take into consideration:

  • Data binding in JavaScript is a little complex for complicated graphs
  • Performance is sometimes an issue, and some popular extensions are prone to memory leaks

Raphael, which was created by Dmitry Baranovskiy as a personal project, provides a set of primitives for vector-based graphics in browsers supporting SVG and VML. It doesn’t provide data-binding capabilities, and it is not specifically chart-oriented. Raphael offers a good API, both for performance and clarity, and superior control over animations. It is also well-versed for any kind of SVG representation, but it doesn’t have primitives for chart creation.

gRaphael is a charting library built upon Raphael that somehow fills this gap. However, neither the number of charts in the showcase nor the set of plugins and extensions are anywhere near the rich set of ready-to-use solutions provided by D3.

Snap.svg is the evolution of Raphael. It's designed for modern browsers (and won't work with obsolete browsers), and it supports the newest SVG features, including masking, clipping, patterns, full gradients and groups.

Snap.svg is probably the best solution at the moment for SVG-based drawing, and for animations or complex visualization involving advanced effects. However, like Raphael, it is a low-level library. In addition, although it is possible to create complete charts using Snap, the amount of work needed in comparison to customizing a chart in D3 showcase (assuming there will be one close to your needs) is not negligible. Moreover, this project is still very young, so using it in production might be a risk at this time.

Declarative Frameworks
As we have noted, D3.js supports data-binding, so that your charts can be updated as the data change. Updates, however, must be manually triggered by developers every time they feel their data might have changed, and the update mechanism must be set in stone explicitly in the code by providing callbacks to be invoked for data points entering and exiting the dataset associated with the graph.

Declarative frameworks also provide primitives for data-binding, and they, too, are not limited to SVG: Any DOM element can be associated with data and kept in sync with it. The difference is that the binding mechanism in declarative frameworks leverages templates and happens in the presentation (the HTML markup); moreover, once data is bound to DOM nodes, the presentation is updated automatically every time the data behind it changes.

These frameworks have an added value that can be summarized in a few key points:

  • They enforce a greater degree of separation between content and presentation, by allowing the binding of event handlers and even composite layout (like tabular or tree-shaped structures) directly in the presentation layer;
  • They provide an easy way to keep your data model and your presentation in sync;
  • They generally do it in an extremely efficient way, making sure to reflow only the minimum possible subset of your DOM. Since reflowing and repainting are usually bottlenecks for client-side browser applications, the gain in terms of performance is absolutely relevant.

Without going too much into detail, it’s interesting to mention that synchronization between data and the DOM is obtained in the frameworks above using either Dirty Checking (Angular) or Container objects (Ractive, React, Ember). While Container objects make Integration with external code more difficult, dirty checking will likely become computationally intensive--if not impractical--as the number of watched objects grows.

Although it might at first sound counterintuitive, declarative frameworks allow for a higher level of separation between logic and presentation. It is true that we move part of the logic to the markup, but we control the degree of this contamination so that we can limit it to a minimum: just the declaration of the binding. The benefits we get in exchange are far more interesting:

  • MVC-compliant code;
  • We completely avoid any presentation trespassing into logic (we’ll see that’s often not the case with D3);
  • We have a higher Decoupling of event handlers' names from their implementation, so that the handlers can be updated or replaced transparently at run time;
  • Repainting and reflow are completely handled by the library;
  • There is a performance gain.

At the moment, there are a couple of chart libraries that work with declarative frameworks:

  • Paths.js is a minimal library by Andrea Ferretti that is, explicitly and by design, oriented to support reactive programming by generating SVG paths that can be used with template engines. It offers three API levels, of increasing abstraction. The lowest level is a chainable API to generate an arbitrary SVG path. On top of this, paths for simple geometric shapes such as polygons or circle sectors are defined. At the highest level, there is an API to generate some simple graphs (such as pie, line chart and radar) for a collection of data, assembling the simple shapes. It can work with any of the frameworks mentioned above and even more, but its ideal partner is probably Ractive.
  • n3-charts is a visualization library specifically designed to be used in conjunction with Angular. At the moment, it offers only line charts, bar charts and pie charts. The API is only high-level but the quality of the graphics is very good.

However, you don’t really need such a library: Because these frameworks supply most of the infrastructure you might need--as well as the ability to structure charts in a more intuitive way using templates--you can create complex, beautiful and reusable charts using SVG markup.

Of course, a library like Paths allows you to add amazing complete charts to your page with a fraction of the effort. But if you need a customized solution, you can start from scratch using low level SVG elements.
For example, you can easily and smartly iterate over a collection of objects in-line in the HTML and associate SVG elements (bars, lines, circles, etc.) with each data point. Add some CSS-styling, and your MVC-compliant chart is ready.

Differences Between Imperative and Declarative Approaches

The best way to show how a D3-based approach differs from a template-based solution that leverages declarative frameworks is, likely, to show you an example. We are going to restrict to a simple example, but more thorough ones can be easily found. (See the conclusions section for some references.)

Our example chart will be a simple bar chart visualizing an array of data (preformatted to be consistent with percentage values, so restricted between 0 and 100) as vertical bars, and showing the corresponding value inside each bar.
To make things a bit more interesting, let’s suppose we have structured data divided in segments, and we want to display a chart for each segment. The data itself will therefore be an array of JavaScript objects, each of them with a name property--a string-- and a set of fields stored as properties, as well, each with a numeric value:

    "name": "Morning",
    "visitors": 22.7,
    "unique_visitors": 42.0,
    "clicks": 12.5
D3 solution
The HTML required with D3 is barely minimal. Basically we just need a container to which we will add our SVG elements. The JavaScript code, instead, is much more interesting:

Some vars, like the Function scalePercent, are omitted for the sake of simplicity; other vars, like svgWidth, drive the dimensions of the chart and can be easily computed from the page dimensions and the number of segments and fields.

The interesting part is, since we want to keep the chart data-driven and dynamic, we iterate over the fields: They are set in the code here, but they could be extracted from data or passed as a parameter within each segment. Anyway, we need a double loop iterating over segments and, inside a segment, over the fields to be visualized. To update the chart, we do need to call this same function again. We can rebind the data to the chart, but we need to go over the process again.

Ractive solution
Let’s contrast the previous code with the kind of solution we can implement using Ractive (or React, Ember, etc.).

The HTML will be the most complicated part this time. In contrast, the JavaScript side is much simpler:
function displayData ( JSON, containerId, title,
              chartWidth, chartHeight, chartAreaHeight) {
          var n = json.length,
              fields = ['visitors', 'unique_visitors', 'clicks'],
              fieldsNbr = fields.length,
              ractive = new Ractive({
                el: container_id,
                template: ractiveTemplateId,
                data: {
                          segments: json,
                          title: title,
                          fields: fields,
                          tableField: 0,
                          rectWidth: chartWidth / n * 0.75 / fieldsNbr - 5,
                          svgHeight: chartAreaHeight,
                          svgWidth: chartWidth / n,
                          chartWidth: chartWidth,
                          chartHeight: chartHeight,
                          rectHeight: rectHeight,
                          sectionWidth: chartWidth / n,
                          rectMargin: (sectionWidth - rectWidth * fieldsNbr)
                                         / (1 + fieldsNbr),
                          rectTopMargin: DEFAULT_BAR_MARGIN,
                          scalePercent: function scalePercent(p) {
                              return p / 100;
          return ractive;   

To update the chart with new data, we only need one line:
ractive.set(‘segments’, newData);

But to be thorough, we can enclose the update logic in a function, and animate the refresh:
                function updateChart(newData) {
                    var n = json.length,
                        svgWidth = ractive.get('chartWidth') / n,
                        sectionWidth = svgWidth;          
                    ractive.animate('svgWidth', svgWidth);
                    ractive.animate('sectionWidth', sectionWidth);
                    ractive.animate('segments', newData);

One thing worth noticing is that the amount of logic that "contaminates" the presentation can be controlled by the developer. For example, instead of
<text x={{rectMargin + field * (rectMargin + rectWidth) + rectWidth * 0.5}} >
you could use
<text x={{textX(field)}} >
provided you add the textX method to the ractive object above (as done for the function scalePercent).

This way, only the binding stays in the HTML, and all the logic is back into JavaScript files. Alternatively, you could go the other way, and put more logic inline.

Main Differences, in Summary
It is, certainly, a matter of taste and habit, but you can easily see how template-based solutions look a bit more structured and are easier to read--it is easy to look at the structure of the page and understand its layout and how elements are related to each other.

Another difference is that with Ractive you can easily bind more than one object to the same chart, iterate different subsections of the chart over different composite objects, and control the overall structure at a fine-grained level.

Finally, handling updates and animations on the data is a lot easier with Ractive. You don’t need to do anything--just bind new data to your chart, and the Framework will take care of everything, in the most efficient way possible.


The main take-aways from this post are:

  1. Your company will most probably need a way to visualize data--customers’ data, usage data, log information, and so on.
  2. The best way to build data visualization capabilities for your product depends on the details of your company/product;
  3. There are a number of solutions available to create amazing visualizations with affordable effort.

If you are interested in getting a better grasp, it's a good idea to test first-hand as many of the libraries mentioned in the post as you can, to see which one better fits your taste and needs.

There are also several articles and reports that are must-reads when it comes to data visualization:

A few books address the topic even deeper:


Be sure to read the next Big Data article: Seldon Predictive API makes Life Easier for Data Scientists