Continued from page 2.
As we add more workers, we're careful not to reduce performance. How effective are our workers? We measure them, so we know well ahead of time if we need any more. Our consistent model is that all worker tasks have analytics end points, which then are called and reported into a StatsD endpoint (DataDog in our case). This allows operations to monitor the health of individual workers, as well as building dashboards to show overall system health.
Through consistent monitoring, we understand the capability of our API. Your API will be similar to ours where traffic will come in bursts and follows a business hours model.
Raygun uses RabbitMQ for queuing and DataDog to monitor the capacity of our workers.
Part of effectively scaling horizontally is the ability to replicate and deploy quickly. To do this, one of the first steps we took as a development team was to remove error-prone manual deployments.
To scale efficiently, you need to find templates that work for your development team. Templates are a set of instructions for the autoscaling mechanism to let it know what to do when it starts up.
With autoscaling, modifying your template whenever you make a change to your API is the best approach. Every time you modify your template, you need to update your template with the API code. Then test your template by using it to cycle nodes, so test on one node then re-deploy.
Set up alerts
To build a scalable API, you need to know immediately when something goes wrong. Collect and display key metrics from your DevOps tools — the more publicly, the better. "Information Radiators," such as TVs with stats around the office, are a great way to keep system health at the forefront of your mind.
We find that more people pick up on problems this way, especially our engineers who are in the codebase all day long. Spikes are detected very quickly if engineers can access baseline figures — plus they can see the results of any improvements they've made to a piece of code.
Here at Raygun, we recognize that taking a proactive approach to understanding the traffic coming to our API is key for horizontal scaling. We use Crash Reporting to identify and raise problems in our code into a dashboard that is accessible by everyone on the development team. We also use Crash Reporting to monitor for errors in our API specifically and collect data on timings. We put a lot of effort into monitoring custom metrics, such as failure per application, so if people send us bad API keys, we can understand where and why traffic is being rejected.
We've found that thorough software testing using both scale testing and production testing is critical to maintaining quality code and to ensure our API is robust and is able to scale when necessary. Here's a brief breakdown of our using scale and production testing management to ensure a robust API and create a better experience for our customers.
Scale testing is a strategy we employ at Raygun, so we know exactly where bottlenecks are. This strategy means we can cater to larger customers with no nasty surprises. To test what our API can handle, we run regular load tests, looking for our upper limit.
While developing an API for any business, you'll be testing and looking for this upper limit so you can constantly push your boundaries and grow — that's the process of scaling, which should grow into autoscaling.
Anything in your system can fail at any time — just make sure you find it before your users do. At Raygun, we also test thoroughly in production. We operate under the assumption there will always be problems and software bugs, but we have visibility on problems in production with Crash Reporting tools. This acceptance of software problems leaves teams much better prepared for reacting and resolving errors faster.
A prevailing attitude at Raygun that has helped us to build and scale is our strategy to make incremental changes to the code so we can react quickly and roll back if necessary.
We have a "fail early and fail fast" approach, which allows us to move and scale up with our business goals. Remember, allow for failure and never get caught off-guard.
The key to scaling your software is locating bottlenecks before your users do, and often your API is what provides the biggest restrictions. Scale your API, and scale your business.