Timeouts, retries with backoff, jitter
A really great article in the AWS builders library about making service to service messaging more reliable with tradeoffs!
First ask is this request retry-able? The work must be idempotent!
Timeouts
- Without thoughtful timeouts, clients can wait for long periods of time tying up limited server resources (eg Request threads of which there are often vanishlingly few) for a response that might not come back (It’s hard to tell the difference between slow, and down)
- He talks about setting a reasonable timeout using percentiles. The 99.9th for eg. Forces developers to ask the question, how many false positive timeouts is ok so that we can set a timeout that is reasonable for an endpoint
Retries
- Selfish. It says your request is worth tying up resources for repeatedly until it succeeds
- Have to be careful here
- Did a request fail because of load? If yes, retrying might prolong a bad situation
- Did it fail with a client error? (4xx) Don’t retry because it will never succeed
- Is it a part of a larger batch of work that becomes a thundering herd retrying in lockstop with eachother prolonging a bad situation
- Retrying is a keystone of resilience. But there are dragons
- Exponential backoff can help a struggling service recover by having clients wait longer between retries when they find out a service is struggling
- Some talk of circuit breakers but it didn’t sound particularly favourable. Adds a different mode in the system that makes testing more challenging
- Think about max retries + error reporting
- Jitter can help quite a bit. Not just for retries but also with the initial arrival of work. Add a tiny bit of random delay (+/-) in the arrival rate can smooth over excessive load
Other good concerns
-
Retries between layers amplify. eg Controller > svc > data access external apicall > … If each layer adds 3 retries, the work may stay in the system and be responsible for dozens or even hundreds of calls. Something to keep in mind