Disaster recovery testing @ Booking.com

Youtube
Booking.com first disaster recovery test - Evacuate traffic out of a region. Started by getting everybody together and asking about how to do it and what would happen.
Failover testing
Goal 1: Full business continuity with the loss of a region
Challenges
- Risk
- Cost
- Buy in
- Investment
Growing the program
- Roadmapping
- Wider communication across teams
- Visibility
- Risk management
- Cross functional collaboration from management to front line workers
We’re “inducing an outage”. Everybody should understand this!
The first time this happened there was lots of manual work, people, time. After there is a postmortem. Issues are fixed. Services are made more similar in terms of safety controls. A subsequent test is schedule with greater risk.
Ongoing process of training the broader team. Awareness of runbooks - where they are and how to use them.
Different degrees of “small” incidents
- Network isolation (some risk)
- Region isolation (more risk)
- DC power failure (lots of risk, abort test is hard. Once you pull the plug you have to turn it all back on. Restarting things can be tricky)

What got you here won’t get you there

Source
Reliability is a lagging indicator in some ways
- What happened or didn’t has already occurred
- It’s in the past
- The user is either still with you and frustrated and left
How to measure? SREs use stuff like slis, slos
- Is this satisfying?
Less quantifiable measures but these feel better
- Oncall engineer responds to and mitigates an incident. Did their action help?
- Team manager holds weekly production review meetings. Are problems starting to creep?
- Customer success asks if a customer is having any problems?
Source
- Break Free of the Template: Incident Writeups They Want to Read
- Be readable
- Don’t write a sales pitch
- Use language that everyone can read (eliminate technical jargon)
- Keep it light if you can (Don’t poke fun at a severe outage that impacted lots of your customers)
- Incident reviews will be more memorable and findable if they have interesting titles bad “Apache httpd restart Sep 1 2022”, maybe better “Apache worker death spiral …”