- On-call with Jérôme Petazzoni: Most recent memory I guess of how Jerome would do on-call:
- He’d be the primary on-call 2 days every week or so
- Pair rotation. There’d be a junior and senior team member on-call at once
- Junior would try to tackle problems during the day, Senior would handle everything else
- Pages would be acknowledged within 15-30m
- If a fix would take upwards of an hour, he’d get other team members involved otherwise he’s try to handle an incident on his own (This sounds high. I probably missed something here.)
- Table top exercises: reboot a server, fill up a vm disk, change a public ip (Nasty lol)
- Dev tools time: Suz Hinton: Loved this one. Tools Suz uses a lot. I should check them out:
- Excellent interview with Neil Gaiman from August 2022: Talked about Sandman the Netflix series and Good Omens 2 (The first series on Amazon Prime Video was so good :))
- NYT platform team sends alerts to users of a multi-tenant kafka service: When a team needing kafka onboarded they would choose slack channels to send alerts to, criticality of alerts for a particular consumer, the threshold they’d want to be alerted / paged at (consumer lag in terms of the number of messages on a kafka topic not yet processed as a key metric), there was also a neat flow diagram:
- Google form
- Gapp script that pushes form data to a backend service on submit
- Create a pull request on infra repo
- Platform team member reviews pr
- Accepts and merge to a prod branch in the infra repo
- Terraform pushes out the change (In this case the change is an auto-generated description of datadog monitor config setting up slack notification rules based on gform data)