Continued from page 1.
We also have statsd metrics for test suites, and individual test cases — this helps us understand what test cases are failing the most, which ones are slow etc. Finally there's a Slack reporter that sends messages about failed runs to a specified room.
- Cron like schedule definition
- Simple to run locally
- API endpoints to interact with the service (pause it, list suites, etc.)
- Fine grained logs to be able to follow runs easily
HTML test report:
The first thing that happened when putting our initial automated setup in place was that we saw a lot of problems. Problems everywhere! We found bugs in our own components, and plenty of issues with depending services, including 3rd party ones. Timeouts, 5xx responses, incorrect status codes, etc. We also realized we had put too much trust in the network, and that our alerts were very sensitive and hence noisy.
Luckily we found these things before our API went into the first beta stage, so we were able to fix all the issues before offering this product to our customers.
However this experience prompted us to make our components more resilient to intermittent issues. Retry mechanisms are great for this. We also tweaked the monitor service's timeouts, and made it only trigger OpsGenie alerts if an error happened twice in a row.
We have been running the latest incarnation of this setup in production since early 2017 with good results. We've made our system more stable and self-healing, reducing the number of incidents drastically. We've also notified maintainers of services we depend on about bugs, performance issues and other surprises. Still we repeatedly find out about 3rd party (or internal) service outages before they are announced! When this happens we can quickly make sure our customers are aware of the problem by updating our status page. Bad deployments are caught very quickly, we had one instance where our streaming API started pushing double encoded JSON payloads — the monitor service caught it, we rolled back, added a test case, fix, and re-deployed within 10 minutes.
As for the implementation we are very happy that we took the route we did. Starting with something manual and then gradually automating that solution helped immensely. So if you team has nothing like this in place yet, you could try out Postman as a quick way to get started.
Finally the possibilities for a monitor service like this are endless — you can automatically close alerts if a failing test case starts passing, you can automatically update your status page, notify other teams, maybe map failed runs to specific deployments, do automatic rollback, and the list goes on. Synthetic Monitoring is an integral part of our system now, and a huge confidence booster.
Are you using this technique as well? What are your experiences? What tools do you use? We'd love to hear about your setup and learnings, and answer any questions in the comments below.