- FOQS: Making a distributed priority queue disaster-ready: Evolution of a queue service through different levels of scale and failure modes.
- Thinking about building reliability now that I’m reading a blog post from Delivery Hero. What is my current list?
- Develop, build, test, deploy, operate, support
- On-call
- Simple, boring, as few processes as necessary*
- Consider availability. How much is needed? Single server, multi-az, multi-region are all valid options depending on service criticality.
- Monitored
- Errors
- Logs should be structured, easy to parse and analyze. They’re also collected centrally and archived for some period of retention to be decided by the business.
- Performance (Simulating production traffic is hard. Pick something easy to get started with and go from there)
- Metrics
- A dashboard per service with key metrics on it relevant to what the component does in the system
- Backed up with backup verification
- Identify regular operational tasks and automate what you can
- Documented with even basic high level architecture diagrams (What’s the data flow look like?)
- Build pipelines
- Reproducible builds
- Freeze the bits between environments
- Integration tests > unit tests > no tests :) (It’s ok if tests need to touch the file system or db.)
- Tests are run with every checkin
- If the build is broken, people stop what they’re doing and fix it because it slows us all down
- People are checking in to trunk and integrating with eachother every day
- Deploy pipelines with promotion process including a pre-prod environment + rollback