Links Facebook regional queue service

FOQS: Making a distributed priority queue disaster-ready: Evolution of a queue service through different levels of scale and failure modes.
Thinking about building reliability now that I’m reading a blog post from Delivery Hero. What is my current list?
- Develop, build, test, deploy, operate, support
- On-call
- Simple, boring, as few processes as necessary*
- Consider availability. How much is needed? Single server, multi-az, multi-region are all valid options depending on service criticality.
- Monitored
  - Errors
  - Logs should be structured, easy to parse and analyze. They’re also collected centrally and archived for some period of retention to be decided by the business.
  - Performance (Simulating production traffic is hard. Pick something easy to get started with and go from there)
  - Metrics
  - A dashboard per service with key metrics on it relevant to what the component does in the system
- Backed up with backup verification
- Identify regular operational tasks and automate what you can
- Documented with even basic high level architecture diagrams (What’s the data flow look like?)
- Build pipelines
  - Reproducible builds
  - Freeze the bits between environments
  - Integration tests > unit tests > no tests :) (It’s ok if tests need to touch the file system or db.)
  - Tests are run with every checkin
  - If the build is broken, people stop what they’re doing and fix it because it slows us all down
  - People are checking in to trunk and integrating with eachother every day
- Deploy pipelines with promotion process including a pre-prod environment + rollback