- Incident Review for Site-wide Outage for GitLab.com - Stale Terraform Pipeline: Never run an old plan against an environment
- BitTorrent v2: Technical description of bittorrent v2. Good bits around the improvements that were made to v1. These kinds of changes probably have to be made over time as you learn about a system. You probably don’t know up front where the opportunities for making things better are going to be. (Or the pain points.) You can try to guess but if you’re doing something you haven’t done before that’ll be tricky. Heck even you do have direct experience. :)
- Cloudflare incident on October 30, 2023: Walkthrough of a very recent cf workers outage. Language is good - transparency, apology, learning, improvement. My instinctive takeaways I guess are automation is hard and complexity doesn’t make it any easier. (There’s much to be said about keeping systems as small and simple as possible. But no simpler.)
- A guide to running Incident Command: Incident commander is an important role during a production event. Like the responsibilities called out here:
- Sounding the alarm,
- gathering response team,
- determine severity (and that severity can change as you learn about an event!),
- the commander role itself can change depending on severity of an event
- communication / response team coordination
- My process when starting a new job: Good advice here when starting with a new team / job:
- Ask your manager who you should talk to
- Learn about goals and be specific
- Keep a note about terms that are new
- Document how people work (what they say they do + what they actually do)
- Look for opportunities to make things better (don’t necessarily do it, but be aware and be ready to help when + where it makes sense)
- Don’t get stressed out - it’s not all on you
- Over communicate what you’re doing and get feedback