How and Why Meltwater Rolled Its Own Custom Testing Solution For Its API

robot testing artHaven't we all been a bit nervous at times about pressing that "Deploy" button, even with amazing test coverage? Especially in the scary world of microservices (aka distributed systems) where the exact constellation of services and their versions is hard to predict.

In this post I will introduce the Synthetic Monitoring concept, which can make you feel more confident about your production system because you find errors quicker. It focuses on improving your Mean Time to Recovery (MTTR), and gathering valuable metrics on how your whole system behaves.

The idea of Synthetic Monitoring is that you monitor your application in production based on how an actual user interacts with it. You can picture it as a watchdog that checks that your system is behaving the way it is supposed to. If something is wrong you get notified and can fix the problem. Our team arrived at using this technique without really knowing that it was an already established concept.

Test suites are a great way to prevent errors from making it to production, improving the Mean Time between Failures (MTBF). However even with solid test coverage things can go wrong when going live, maybe something is misconfigured, a gateway is down, one of the services we depend on has issues, or simply lacking test coverage. When something goes wrong in production we want a low MTTR so user impact is kept to a minimum.

Our Problem

The Meltwater API is comprised of many different components, which may be deployed multiple times a day. Our components also have dependencies on other internal and external services. But, including all these other services in our local or CI tests is slow or brittle at best, and in some cases not even feasible.

This realization prompted us to develop a set of contract tests that check that the API is behaving as specified in our Documentation, and hence the users of our APIs actually get what they expect. If any of the checks fail we want to be notified about it.


Approach 1: Postman

We started by creating a collection of HTTP requests for our external facing endpoints in Postman. Postman is a tool for API development. You can run HTTP requests, create test cases based on those and gather your test cases into collections. After every deployment we ran all of the Postman tests as a manual procedure against production to detect any signs of smoke. If something failed we could make the decision to either roll back or roll forward.

Even though we moved away from this approach quickly, we still sometimes use this Postman collection for manual testing.


Approach 2: Newman

The manual approach quickly got old as we increased the number of deployments per day. Newman to the rescue! Newman is a command line interface and Library for Postman (see Run and Test Postman Collections Directly from the Command Line). We built a small service around Newman to run our Postman collection every five minutes and notify us via OpsGenie (an incident management software we use) in case of failure.

Approach 3: A Custom Monitor-service

The Postman+Newman combination eventually surfaced some annoyances and shortcomings for our team as we relied on it more and more:

  • Writing test cases in JSON is limiting, hard to read, and does not give you much flexibility in terms of sharing functionality between tests.
  • Another issue was that we wanted more and better instrumentation, and reporting capabilities.
  • We wanted to broaden the coverage of the monitoring a little bit. This meant introducing a small set of more whitebox type tests to detect other type of problems, and those tests need to do more things than just HTTP requests.

So we rolled our own solution. The good thing about having gone from manual to an automated solution beforehand was that we knew exactly what we wanted.

Defining test cases

We wanted to write tests in a real programming language. This way you get the full power and flexibility of the language. You get help from your editor and the compiler. You can share code between tests. Below is a Elixir snippet from a test case that makes sure it's possible to create a search:

defmodule MonitorService.Case.CreateSearch do
  @moduledoc """
  Verify that we can create a new search
  use MonitorService.Case

  def test(context) do
    now = DateTime.to_iso8601(DateTime.utc_now())
    name = "#{name_prefix()}_#{now}"
    search = create_search(name, context.user_key, context.access_token)
    assert search["name"] == name

By using Elixir's built-in ExUnit assertions we get the benefits of helpful, informative error messages.

Report output

We knew we wanted different types of reports based on the results of the test runs. OpsGenie alerts were the minimum requirement so that we could notify the engineer on call. HTML reports have proven to be useful to get a first rough idea of what's wrong.

Be sure to read the next Monitoring article: Engauge Opens API Access to Analog Gauges