Limiting API usage is a standard technique to avoid overloading your server or database at critical times. It’s not always so easy to manage without annoying API clients. Ben Weintraub over at the New Relic blog explains the techniques the analytics company developed to limit resource usage and how it dealt with a mysterious issue that threatened to overload the API.
One of the first things the team did to prevent API overloading was create a separate pool of hardware for the API, away from the pool for the UI. The API would have access to a Unicorn web server that would create worker processes, each of which would only handle one request at a time. This had the benefit of strong fault isolation while having the disadvantage that you needed N workers to handle N concurrent requests. Having separate resources for the API, however, meant that clients could never commandeer all the workers.
Despite this work, it was still possible for an API client to break the UI by affecting shared dependencies. To ward off this danger, New Relic created API Overload Protection, a resource-limiting tool designed to track Unicorn worker time used for each account. The tool automatically restricts a client’s API access if his worker time exceeds a certain number. Overload Protection was built to run separate from the app so it could keep working even if the app had problems.
This is where things get weird. This tool should have cut off API access for clients with excessive usage. But it didn’t. There were clients for whom the internal transaction data from the API showed they had used way too much worker time but Overload Protection said they had only used a small amount. To figure out what was going on, the team decided to look at the nginx access logs.
Now the nginx response times should have been tightly correlated with worker times because the nginx server spends most of its time just waiting for a response from Unicorn. This wasn’t the case for requests lasting four to five seconds. These requests often resulted in a 404 error even when the endpoint was correct. Looking at the relevant transaction events, the team noticed that the request apparently could last 30 seconds even when nginx gave a response time of four something seconds. Digging deeper, it noticed that nginx hadn’t received a response from Unicorn when it shut off the request.
All these requests turned out to be from a user-agent with the name ‘New Relic Exporter’. On GitHub, this name was sourced to an open-source tool for exporting data from New Relic that had a client-side timeout of 5 seconds. The culprit was client-side timeouts.
To fix this, the team set the proxy_ignore_client_abort setting in the nginx config files. This instructs nginx to defer post-processing work until Unicorn returns a response rather than than doing it when the client closes the connection. This ensured that Overload Protection would get an accurate reading of Unicorn worker time for each account.