• FOQS: Making a distributed priority queue disaster-ready: Evolution of a queue service through different levels of scale and failure modes.
  • Thinking about building reliability now that I’m reading a blog post from Delivery Hero. What is my current list?
    • Develop, build, test, deploy, operate, support
    • On-call
    • Simple, boring, as few processes as necessary*
    • Consider availability. How much is needed? Single server, multi-az, multi-region are all valid options depending on service criticality.
    • Monitored
      • Errors
      • Logs should be structured, easy to parse and analyze. They’re also collected centrally and archived for some period of retention to be decided by the business.
      • Performance (Simulating production traffic is hard. Pick something easy to get started with and go from there)
      • Metrics
      • A dashboard per service with key metrics on it relevant to what the component does in the system
    • Backed up with backup verification
    • Identify regular operational tasks and automate what you can
    • Documented with even basic high level architecture diagrams (What’s the data flow look like?)
    • Build pipelines
      • Reproducible builds
      • Freeze the bits between environments
      • Integration tests > unit tests > no tests :) (It’s ok if tests need to touch the file system or db.)
      • Tests are run with every checkin
      • If the build is broken, people stop what they’re doing and fix it because it slows us all down
      • People are checking in to trunk and integrating with eachother every day
    • Deploy pipelines with promotion process including a pre-prod environment + rollback