Retries Can Kill You

In a large-scale distributed system, it's inevitable that some requests will fail. Even if your collaborating systems work perfectly, sooner or later you will experience temporary network issues and other intermittent errors. That's why a lot of people try to paint over this issue by implementing retries in their applications. Unfortunately, if it's not done properly, this can cause serious stability problems.

Let's assume for a minute that your requests are idempotent so they can be safely repeated without causing duplicate orders, money being deducted twice, or similar. Let's further assume that someone implemented a naive strategy of retrying a failed request two times. What could possibly go wrong?

The problem is that once your collaborating system starts to struggle for some reason, i.e. it operates at capacity, the retries will push it over the edge **- basically, you're mounting a **denial of service attack on yourself.

Most senior engineers know about this and design their systems to fail fast, which is a well-known stability patterns (see Nygard's Release It!). But sometimes, retries can sneak up on you, for example when running a service behind a reverse proxy. I've once spent a lot of time trying to figure out where the huge amount of requests were coming from that killed a service. Only after I finally obtained access to the proxy configuration, I saw that someone had configured a 10-times retry policy.

Still, I think retries are a good thing if applied carefully. The techniques I've written about for preventing cascading failure apply here as well: take control of how much traffic your clients are sending to your servers. However, you can also handle this as a special case using retry budgets. You could for example only allow for N retries per second, which is easy to implement using a rate limiter, or make the budget a percentage of the successful requests you're sending.

In any case, I highly recommend monitoring and alerting for your retry mechanisms and to regularly review the reasons for those retries.