Disaster recovery testing @ Booking.com
- Youtube
- Booking.com first disaster recovery test - Evacuate traffic out of a region. Started by getting everybody together and asking about how to do it and what would happen.
- Failover testing
- Goal 1: Full business continuity with the loss of a region
- Challenges
- Risk
- Cost
- Buy in
- Investment
- Growing the program
- Roadmapping
- Wider communication across teams
- Visibility
- Risk management
- Cross functional collaboration from management to front line workers
- We’re “inducing an outage”. Everybody should understand this!
- The first time this happened there was lots of manual work, people, time. After there is a postmortem. Issues are fixed. Services are made more similar in terms of safety controls. A subsequent test is schedule with greater risk.
- Ongoing process of training the broader team. Awareness of runbooks - where they are and how to use them.
- Different degrees of “small” incidents
- Network isolation (some risk)
- Region isolation (more risk)
- DC power failure (lots of risk, abort test is hard. Once you pull the plug you have to turn it all back on. Restarting things can be tricky)
What got you here won’t get you there
- Source
- Reliability is a lagging indicator in some ways
- What happened or didn’t has already occurred
- It’s in the past
- The user is either still with you and frustrated and left
- How to measure? SREs use stuff like slis, slos
- Less quantifiable measures but these feel better
- Oncall engineer responds to and mitigates an incident. Did their action help?
- Team manager holds weekly production review meetings. Are problems starting to creep?
- Customer success asks if a customer is having any problems?
-
- Source
- Break Free of the Template: Incident Writeups They Want to Read
- Be readable
- Don’t write a sales pitch
- Use language that everyone can read (eliminate technical jargon)
- Keep it light if you can (Don’t poke fun at a severe outage that impacted lots of your customers)
- Incident reviews will be more memorable and findable if they have interesting titles bad “Apache httpd restart Sep 1 2022”, maybe better “Apache worker death spiral …”