This guest post comes from Chad R. Smith, founder of The Easy API. He has worked with developers all over the world assisting with architecture and functionality for high availability APIs.
Asking the simple question to API maintainers of how scalable is your API seems to conjure some awkward pauses. The Easy API recently discovered how well equipped we were to handle a massive influx of requests to our system. Quickly it became evident that the system wasn’t able to handle over a million requests a month, and failed under heavy load. This article discusses programming, servers, and monitoring changes that helped bring The Easy API back online and into the next level. The techniques discussed played a critical role in helping The Easy API scale to over a million requests a month and growing rapidly.
The beginning of 2011 marked a major milestone for The Easy API. It was one year since its inception and it was picking up a lot of users and requests. The original application was built on top of the CakePHP framework and the server was in Rackspace's cloud (Mosso). During the month of December we picked up one of the largest SEO companies in the world, which started to use the services. When their product began to take shape and they began using the system more the server began to fail under the load. In the short term we decided that it would be best to upgrade the server and make the following changes.
First change that happened from a product perspective was to not use a traditional PHP framework. Instead it was decided that we would create our own classes and roll our own framework. This allowed for greater flexibility and lower memory consumption since the overhead would be greatly reduced since it would only include what was needed to facilitate the request at hand. Utilizing best practices PHP OOP5 code we were able to develop a class system that was both efficient and provided greater security control. We implemented methods that wouldn’t touch a database unless absolutely necessary. Items that should have been in place were not possible, or poorly executed with the previous CakePHP version of the API. Additionally it allowed us to not just be a POST XML style API but to morph into POST XML, JSON, JSONP, and REST.
We listened to our customers when making the changes and implemented an advanced throttling system that would limit the amount of processes that were currently running for any one user. This allows a “fair usage” between the users and is class-based as well. We implemented memcache where needed to help speed up the processes and cut down on the dependency of the database. Instead of using an external facing IP address to access the database we used the internal address which is only on the Mosso network. This allowed for Gigabit transmission between the two servers.
With the new API we added in new input and output methods, throttling, and caching where it was needed. The programming behind the system makes it easier to add additional services and maintain a high level of reliability because of the error reporting that’s now built in.
The website, API, and database were all on one small server that was running on 256MB of RAM. The details of the server were Ubuntu 9.04, Apache2, PHP5, MySQL 5.1 and used the CakePHP framework. When the heavy load started the initial course of action was to upgrade the server to 1GB of RAM to help with the load.
After development was complete on the second version of the API we constructed a server architecture that would scale as our customer’s needs scaled. We moved to a distributed server architecture having the database, website, and API all on their own servers. The website is still on Apache2, PHP5, and CakePHP. Since it’s not handling the API requests anymore the server was downgraded back to its original RAM configuration.
The API server was originally started with Nginx, and PHP-FPM and it ran fine for the first couple of days until we noticed that there was an influx of 500 internal server errors that were being generated. We started some troubleshooting and determined that when multiple requests from the same IP hit within a few milliseconds of each other the server would deny the request throwing a 500 error. Once this was brought to our attention we decided to look into alternatives to Nginx, as it wasn’t right for our server needs.
First alternative we tested was Lighttpd, and once again we were very disappointed with the performance of the server. It just didn’t provide the flexibility and efficiency that we were requiring with our customers. We spoke with experts who setup those on a regular basis and it was similar to the experience we had with Nginx. It was great for caching and static content, but an API is neither static nor do you want it to be cached.
The next server we tried was Cherokee Server, which is developed by Alvaro Lopez Ortega. We immediately fell in love with the server, it was so easy to setup and required little to no memory to run. Inside of our staging environment we noticed that there were big memory usage bonuses that Nginx and Lighttpd just couldn’t offer. The server ran more efficiently than the others tested and we brought it into production in the middle of February. That month alone we had over 1.2 million requests, which was the highest month we’ve had to date.
Even with our new API we found memory leaks that could not be avoided. This is the primary reason why we decided to use PHP-FPM, and if your API is built on PHP you might want to consider switching to PHP-FPM (or another Fast-CGI-based PHP implementation). It allowed us to control how many worker processes were running and how many requests that each worker was responsible for before the parent process would kill it and restart the process. This would then release the memory back to the system to be utilized by the HTTP server.
Load Testing and Server Monitoring
How do you actually test the throughput that can be handled by your API? That was the number one question that we asked ourselves and finally found an answer. Pylot is an open source performance and scalability tester. It’s written in Python and will generate concurrent HTTP requests to the server you define with multiple "agents." It verifies server responses and produces reports with the metrics from the agent’s responses. It will help you troubleshoot server 500 errors or memory leaks that happen as part of everyday code. This allows troubleshooting and memory consumption testing on a scale that is unprecedented.
After our ordeal we sought a secure, reliable, and feature-rich monitoring suite that would allow us to know how our servers are doing and what issues they are encountering. We contacted a few vendors but ultimately decided on ScienceLogic’s EM7 G3 system. With their G3 product we are able to point to the IP address of our servers and it’s running a full time monitoring process. It’s a rack-mounted system that we have at a co-location. It had dynamic applications that were already developed to handle our server types and monitor them correctly. This allows us to not only be able to monitor our own servers but to also monitor our applications with their advanced monitoring policies that we can setup. Furthermore there are polices we setup to access the server logs and count the 500 errors that we have and alert on those when there are issues. We setup thresholds on everything down to the amount of bandwidth that’s being consumed in a given 5 minute time window so we will get an alert when the server needs to watched closely. The staff was very helpful during the deployment and though it’s not the right solution for every API out there, for our business it’s critical that we are operating at peak efficiency.